반정형 데이터 - XML · 공부를 합시다

빅데이터 청년인재 Day 8

XML

XML이란

Markup(열고 닫는) language
데이터를 저장하고 전송 할 수 있다.
human and machine readble

왜 XML을 사용하는가

어디서나 공용으로 사용할 수 있는 structing data
소프트웨어와 하드웨어로부터 독립적
잘 짜여진 구조

특징

태그가 미리 정의되어 있지 않음
- 따라서 사용자들끼리 명시해야한다
텍스트 기반
- 더 읽기 쉽고 문서 작성도 쉽고 디버깅도 쉽다
확장 가능성
- 객첸체나 계층, 관계 같은 풍부한 구조를 지원한다
- 대부분의 XML 어플리케이션은 새로운 데이터가 추가되거나 삭제되도 예상대로 작동한다.
validity

Well-formed Documents

올바른 syntax로 쓰여진 XML 문서를 Well Formed 라 한다.
루트 노드를 반드시 가지고 있다.
여는 태그와 닫는 태그를 반드시 가지고 있다.
대소문자를 구분한다.
요소들은 반드시 올바르게 중첩되어 있어야한다.
어트리뷰트들은 반드시 ““에 쌓여 있어야 한다.

Valid Documents

Well Formed XML 문서는 valid XML 문서와 같지 않다.
- DTD - The original Document Type Definition
- XML Schema - An XML-based alternative to DTD

XML with Python

공식 문서 : 여기
라이브러리
- xml : import xml.etree.ElementTree as et
- lxml : from lxml import etree

1) XML

(1) 만들기

# bookstore부터 시작하는 DOM
bookStore = et.Element("bookstore")

book1 = et.Element("book", category="cooking")
bookStore.insert(0, book1)

title1 = et.Element("title")
title1.attrib["lang"] = "en"
title1.text = "Everyday Italian"
book1.append(title1)

# 노드에서 sub element 생성
et.SubElement(book1, "author").text = "Giada De Laurentiis"
et.SubElement(book1, "year").text = "2005"
et.SubElement(book1, "price").text = "30.00"

book2 = et.Element("book", {"category":"children"})
bookStore.append(book2)

title2 = et.Element("title")
title2.attrib["lang"] = title1.get("lang")
title2.text = "Harry Potter"
book2.append(title2)

et.SubElement(book2, "author").text = "Giada De Laurentiis"
et.SubElement(book2, "year").text = "2005"
et.SubElement(book2, "price").text = "30.00"

et.dump(bookStore)

# output 

<bookstore><book category="cooking"><title lang="en">Everyday Italian</title><author>Giada De Laurentiis</author><year>2005</year><price>30.00</price></book><book category="children"><title lang="en">Harry Potter</title><author>Giada De Laurentiis</author><year>2005</year><price>30.00</price></book></bookstore>

(2) parsing

root = et.fromstring(et.tostring(bookStore))

#self.children, list(elem)
childNodes = root.getchildren()
print(len(childNodes))
for childNode in childNodes:
    print(childNode.tag, childNode.items())
    
for childNode in childNodes[0]:
    print(childNode.tag, childNode.keys())
    if childNode.keys() != []:
        print([childNode.get(k) for k in childNode.keys()])
        
        
book = root.find("book")
print(book.tag, book.get("category"))

bookList = root.findall("book")
for book in bookList:
    print(book.tag, book.get("category"))
    
# .은 현재위치 - 여기서는 루트이다.
# 즉 루트 아래 자식요소 중에 타이틀인거 찾아라
title = root.find(".//title")
print(type(title), title.text)

titleList = root.findall(".//title")
print([title.text for title in titleList])

title = root.findtext(".//title")
print(type(title), title)

book = root.find(".//book[@category='children']")
print(book, book.tag)

# output 

2
book [('category', 'cooking')]
book [('category', 'children')]
title ['lang']
['en']
author []
year []
price []
book cooking
book cooking
book children
<class 'xml.etree.ElementTree.Element'> Everyday Italian
['Everyday Italian', 'Harry Potter']
<class 'str'> Everyday Italian
<Element 'book' at 0x10dd81bd8> book

(3) xml 파일

write

from xml.etree.ElementTree import ElementTree

tree = ElementTree(root)
tree.write("book_xml.xml", encoding="utf-8", xml_declaration="utf-8")

parse

from xml.etree.ElementTree import parse

tree = parse("book_xml.xml")
root = tree.getroot()

for node in root.iter():
    print(node.tag, node.text)

# output

bookstore None
book None
title Everyday Italian
author Giada De Laurentiis
year 2005
price 30.00
book None
title Harry Potter
author Giada De Laurentiis
year 2005
price 30.00

ElementTree

tree = ElementTree(file="book_xml.xml")
root = tree.getroot()

for node in root.iter():
    print(node.tag, node.text)

bookstore None
book None
title Everyday Italian
author Giada De Laurentiis
year 2005
price 30.00
book None
title Harry Potter
author Giada De Laurentiis
year 2005
price 30.00

2) lXML

(1) 만들기

from lxml import etree

bookStore = etree.Element("bookstore")

book1 = etree.SubElement(bookStore, "book")
book2 = etree.SubElement(bookStore, "book", attrib={"category":"children"})

book1.attrib["category"] = "cooking"

title1 = etree.Element("title", lang="en")
title1.text = "Everyday Italian"
book1.append(title1)

# 노드에서 sub element 생성
etree.SubElement(book1, "author").text = "Giada De Laurentiis"
etree.SubElement(book1, "year").text = "2005"
etree.SubElement(book1, "price").text = "30.00"

title2 = etree.Element("title")
title2.set("lang", title1.get("lang"))
title2.text = "Harry Potter"
book2.append(title2)

etree.SubElement(book2, "author").text = "Giada De Laurentiis"
etree.SubElement(book2, "year").text = "2005"
book2.insert(3, etree.Element("price"))

print(len(book2))
book2[-1].text = "30.00"

xmlBytes = etree.tostring(bookStore, encoding="utf-8", pretty_print=True, xml_declaration=True)
xmlStr = etree.tounicode(bookStore, pretty_print=True)
print(type(xmlBytes), type(xmlStr))
etree.dump(bookStore)

# output 

<class 'bytes'> <class 'str'>
<bookstore>
  <book category="cooking">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
  <book category="children">
    <title lang="en">Harry Potter</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
</bookstore>

ElementTree

xml = etree.XML(etree.tostring(bookStore))
xmlTree = etree.ElementTree(xml)
xmlRoot = xmlTree.getroot()

print(xmlTree.docinfo.xml_version)
print(xmlTree.docinfo.encoding)
print(xmlTree.docinfo.doctype)
print(xmlTree.docinfo.root_name)

print(len(xmlRoot))
for childNode in xmlRoot:
    print(childNode.tag, childNode.attrib)

# output

1.0
UTF-8

bookstore
2
book {'category': 'cooking'}
book {'category': 'children'}

xml = etree.fromstring(etree.tostring(bookStore))
xmlTree = etree.ElementTree(xml)
xmlRoot = xmlTree.getroot()

childNodes = xmlRoot.getchildren()

print(len(childNodes))
for childNode in childNodes:
    print(childNode.tag, childNode.items())
    
for childNode in childNodes[0]:
    print(childNode.tag, childNode.keys())
    if childNode.keys() != []:
        print([childNode.get(k) for k in childNode.keys()])
        
book = xmlRoot.find("book")
print(book.tag, book.get("category"))

bookList = xmlRoot.findall("book")
for book in bookList:
    print(book.tag, book.get("category"))
    
title = xmlRoot.find(".//title")
print(type(title), title.text)

titleList = xmlRoot.findall(".//title")
print([title.text for title in titleList])

title = xmlRoot.findtext(".//title")
print(type(title), title)

book = xmlRoot.find(".//book[@category='children']")
print(book, book.tag)

for cilldNode in xmlRoot.iter():
    print(childNode.tag, childNode.text)
    
for childNode in xmlRoot.iter("book"):
    print(childNode.tag, childNode.text)

# output

2
book [('category', 'cooking')]
book [('category', 'children')]
title ['lang']
['en']
author []
year []
price []
book cooking
book cooking
book children
<class 'lxml.etree._Element'> Everyday Italian
['Everyday Italian', 'Harry Potter']
<class 'str'> Everyday Italian
<Element book at 0x10dd23b08> book
price 30.00
price 30.00
price 30.00
price 30.00
price 30.00
price 30.00
price 30.00
price 30.00
price 30.00
price 30.00
price 30.00
book None
book None

write & parse

xmlTree.write("book_tree.xml")
etree.ElementTree(xmlRoot).write("book_root.xml")

xmlTree = etree.parse("book_tree.xml")
xmlRoot = xmlTree.getroot()

etree.dump(xmlRoot)

xmlTree = etree.parse("book_root.xml")
xmlRoot = xmlTree.getroot()

etree.dump(xmlRoot)

# output

<bookstore>
  <book category="cooking">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
  <book category="children">
    <title lang="en">Harry Potter</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
</bookstore>
<bookstore>
  <book category="cooking">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
  <book category="children">
    <title lang="en">Harry Potter</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
</bookstore>

공공데이터 open API로 데이터 가져오기

import urllib
from lxml import etree

# 서비스 url
url = "http://openapi.airkorea.or.kr/openapi/services/rest/ArpltnInforInqireSvc/getMsrstnAcctoRltmMesureDnsty"

# 필요한 파라미터
params = {
    "serviceKey" : "개인 키",
    "numOfRows" : 10,
    "pageSize" : 10,
    "pageNo" : 1,
    "startPage" : 1,
    "stationName" : "노원구",
    "dataTerm" : "DAILY",
    "ver" : "1.3",
}

params["serviceKey"] = urllib.parse.unquote(params["serviceKey"])
params = urllib.parse.urlencode(params)
params = params.encode("utf-8")

# get 방식으로 url 날림
req = urllib.request.Request(url, data=params)
# request 객체 만들어서 보냄.
res = urllib.request.urlopen(req)

resStr = res.read()

xmlObj = etree.fromstring(resStr)
xmlRoot = etree.ElementTree(xmlObj).getroot()

pm25List = xmlRoot.findall(".//item/pm25Value")
for item in pm25List:
    print(item.tag, item.text)

# output

pm25Value 1
pm25Value 2
pm25Value 3
pm25Value 3
pm25Value 3
pm25Value 4
pm25Value 3
pm25Value 2
pm25Value -
pm25Value -

공부하자

반정형 데이터 - XML

빅데이터 청년인재 Day 8

목차

XML

XML이란

왜 XML을 사용하는가

특징

Well-formed Documents

Valid Documents

XML with Python

1) XML

(1) 만들기

(2) parsing

(3) xml 파일

2) lXML

(1) 만들기

공공데이터 open API로 데이터 가져오기

Comments

공부하자

반정형 데이터 - XML

빅데이터 청년인재 Day 8

목차

XML

XML이란

왜 XML을 사용하는가

특징

Well-formed Documents

Valid Documents

XML with Python

1) XML

(1) 만들기

(2) parsing

(3) xml 파일

2) lXML

(1) 만들기

공공데이터 open API로 데이터 가져오기

Related Posts

정규표현식 (Regular Expression) 14 Jul 2019

반정형 데이터 - JSON 10 Jul 2019

정형 데이터 - ORM, 정규식 09 Jul 2019

Comments