공부하자

반정형 데이터 - XML

|

빅데이터 청년인재 Day 8


목차


XML


XML이란

  • Markup(열고 닫는) language
  • 데이터를 저장하고 전송 할 수 있다.
  • human and machine readble


왜 XML을 사용하는가

  • 어디서나 공용으로 사용할 수 있는 structing data
  • 소프트웨어와 하드웨어로부터 독립적
  • 잘 짜여진 구조


특징

  • 태그가 미리 정의되어 있지 않음
    • 따라서 사용자들끼리 명시해야한다
  • 텍스트 기반
    • 더 읽기 쉽고 문서 작성도 쉽고 디버깅도 쉽다
  • 확장 가능성
    • 객첸체나 계층, 관계 같은 풍부한 구조를 지원한다
    • 대부분의 XML 어플리케이션은 새로운 데이터가 추가되거나 삭제되도 예상대로 작동한다.
  • validity


Well-formed Documents

  • 올바른 syntax로 쓰여진 XML 문서를 Well Formed 라 한다.
  • 루트 노드를 반드시 가지고 있다.
  • 여는 태그와 닫는 태그를 반드시 가지고 있다.
  • 대소문자를 구분한다.
  • 요소들은 반드시 올바르게 중첩되어 있어야한다.
  • 어트리뷰트들은 반드시 ““에 쌓여 있어야 한다.


Valid Documents

  • Well Formed XML 문서는 valid XML 문서와 같지 않다.
    • DTD - The original Document Type Definition
    • XML Schema - An XML-based alternative to DTD


XML with Python


  • 공식 문서 : 여기
  • 라이브러리
    • xml : import xml.etree.ElementTree as et
    • lxml : from lxml import etree

1) XML

(1) 만들기

# bookstore부터 시작하는 DOM
bookStore = et.Element("bookstore")

book1 = et.Element("book", category="cooking")
bookStore.insert(0, book1)

title1 = et.Element("title")
title1.attrib["lang"] = "en"
title1.text = "Everyday Italian"
book1.append(title1)

# 노드에서 sub element 생성
et.SubElement(book1, "author").text = "Giada De Laurentiis"
et.SubElement(book1, "year").text = "2005"
et.SubElement(book1, "price").text = "30.00"

book2 = et.Element("book", {"category":"children"})
bookStore.append(book2)

title2 = et.Element("title")
title2.attrib["lang"] = title1.get("lang")
title2.text = "Harry Potter"
book2.append(title2)

et.SubElement(book2, "author").text = "Giada De Laurentiis"
et.SubElement(book2, "year").text = "2005"
et.SubElement(book2, "price").text = "30.00"

et.dump(bookStore)
# output 

<bookstore><book category="cooking"><title lang="en">Everyday Italian</title><author>Giada De Laurentiis</author><year>2005</year><price>30.00</price></book><book category="children"><title lang="en">Harry Potter</title><author>Giada De Laurentiis</author><year>2005</year><price>30.00</price></book></bookstore>

(2) parsing

root = et.fromstring(et.tostring(bookStore))

#self.children, list(elem)
childNodes = root.getchildren()
print(len(childNodes))
for childNode in childNodes:
    print(childNode.tag, childNode.items())
    
for childNode in childNodes[0]:
    print(childNode.tag, childNode.keys())
    if childNode.keys() != []:
        print([childNode.get(k) for k in childNode.keys()])
        
        
book = root.find("book")
print(book.tag, book.get("category"))

bookList = root.findall("book")
for book in bookList:
    print(book.tag, book.get("category"))
    
# .은 현재위치 - 여기서는 루트이다.
# 즉 루트 아래 자식요소 중에 타이틀인거 찾아라
title = root.find(".//title")
print(type(title), title.text)

titleList = root.findall(".//title")
print([title.text for title in titleList])

title = root.findtext(".//title")
print(type(title), title)

book = root.find(".//book[@category='children']")
print(book, book.tag)
# output 

2
book [('category', 'cooking')]
book [('category', 'children')]
title ['lang']
['en']
author []
year []
price []
book cooking
book cooking
book children
<class 'xml.etree.ElementTree.Element'> Everyday Italian
['Everyday Italian', 'Harry Potter']
<class 'str'> Everyday Italian
<Element 'book' at 0x10dd81bd8> book

(3) xml 파일

  • write
from xml.etree.ElementTree import ElementTree

tree = ElementTree(root)
tree.write("book_xml.xml", encoding="utf-8", xml_declaration="utf-8")
  • parse
from xml.etree.ElementTree import parse

tree = parse("book_xml.xml")
root = tree.getroot()

for node in root.iter():
    print(node.tag, node.text)
# output

bookstore None
book None
title Everyday Italian
author Giada De Laurentiis
year 2005
price 30.00
book None
title Harry Potter
author Giada De Laurentiis
year 2005
price 30.00
  • ElementTree
tree = ElementTree(file="book_xml.xml")
root = tree.getroot()

for node in root.iter():
    print(node.tag, node.text)
bookstore None
book None
title Everyday Italian
author Giada De Laurentiis
year 2005
price 30.00
book None
title Harry Potter
author Giada De Laurentiis
year 2005
price 30.00

2) lXML

(1) 만들기

from lxml import etree

bookStore = etree.Element("bookstore")

book1 = etree.SubElement(bookStore, "book")
book2 = etree.SubElement(bookStore, "book", attrib={"category":"children"})

book1.attrib["category"] = "cooking"

title1 = etree.Element("title", lang="en")
title1.text = "Everyday Italian"
book1.append(title1)

# 노드에서 sub element 생성
etree.SubElement(book1, "author").text = "Giada De Laurentiis"
etree.SubElement(book1, "year").text = "2005"
etree.SubElement(book1, "price").text = "30.00"

title2 = etree.Element("title")
title2.set("lang", title1.get("lang"))
title2.text = "Harry Potter"
book2.append(title2)

etree.SubElement(book2, "author").text = "Giada De Laurentiis"
etree.SubElement(book2, "year").text = "2005"
book2.insert(3, etree.Element("price"))

print(len(book2))
book2[-1].text = "30.00"

xmlBytes = etree.tostring(bookStore, encoding="utf-8", pretty_print=True, xml_declaration=True)
xmlStr = etree.tounicode(bookStore, pretty_print=True)
print(type(xmlBytes), type(xmlStr))
etree.dump(bookStore)
# output 

<class 'bytes'> <class 'str'>
<bookstore>
  <book category="cooking">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
  <book category="children">
    <title lang="en">Harry Potter</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
</bookstore>
  • ElementTree
xml = etree.XML(etree.tostring(bookStore))
xmlTree = etree.ElementTree(xml)
xmlRoot = xmlTree.getroot()

print(xmlTree.docinfo.xml_version)
print(xmlTree.docinfo.encoding)
print(xmlTree.docinfo.doctype)
print(xmlTree.docinfo.root_name)

print(len(xmlRoot))
for childNode in xmlRoot:
    print(childNode.tag, childNode.attrib)
# output

1.0
UTF-8

bookstore
2
book {'category': 'cooking'}
book {'category': 'children'}
xml = etree.fromstring(etree.tostring(bookStore))
xmlTree = etree.ElementTree(xml)
xmlRoot = xmlTree.getroot()

childNodes = xmlRoot.getchildren()

print(len(childNodes))
for childNode in childNodes:
    print(childNode.tag, childNode.items())
    
for childNode in childNodes[0]:
    print(childNode.tag, childNode.keys())
    if childNode.keys() != []:
        print([childNode.get(k) for k in childNode.keys()])
        
book = xmlRoot.find("book")
print(book.tag, book.get("category"))

bookList = xmlRoot.findall("book")
for book in bookList:
    print(book.tag, book.get("category"))
    
title = xmlRoot.find(".//title")
print(type(title), title.text)

titleList = xmlRoot.findall(".//title")
print([title.text for title in titleList])

title = xmlRoot.findtext(".//title")
print(type(title), title)

book = xmlRoot.find(".//book[@category='children']")
print(book, book.tag)

for cilldNode in xmlRoot.iter():
    print(childNode.tag, childNode.text)
    
for childNode in xmlRoot.iter("book"):
    print(childNode.tag, childNode.text)
# output

2
book [('category', 'cooking')]
book [('category', 'children')]
title ['lang']
['en']
author []
year []
price []
book cooking
book cooking
book children
<class 'lxml.etree._Element'> Everyday Italian
['Everyday Italian', 'Harry Potter']
<class 'str'> Everyday Italian
<Element book at 0x10dd23b08> book
price 30.00
price 30.00
price 30.00
price 30.00
price 30.00
price 30.00
price 30.00
price 30.00
price 30.00
price 30.00
price 30.00
book None
book None
  • write & parse
xmlTree.write("book_tree.xml")
etree.ElementTree(xmlRoot).write("book_root.xml")

xmlTree = etree.parse("book_tree.xml")
xmlRoot = xmlTree.getroot()

etree.dump(xmlRoot)

xmlTree = etree.parse("book_root.xml")
xmlRoot = xmlTree.getroot()

etree.dump(xmlRoot)
# output

<bookstore>
  <book category="cooking">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
  <book category="children">
    <title lang="en">Harry Potter</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
</bookstore>
<bookstore>
  <book category="cooking">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
  <book category="children">
    <title lang="en">Harry Potter</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
</bookstore>

공공데이터 open API로 데이터 가져오기


import urllib
from lxml import etree

# 서비스 url
url = "http://openapi.airkorea.or.kr/openapi/services/rest/ArpltnInforInqireSvc/getMsrstnAcctoRltmMesureDnsty"

# 필요한 파라미터
params = {
    "serviceKey" : "개인 키",
    "numOfRows" : 10,
    "pageSize" : 10,
    "pageNo" : 1,
    "startPage" : 1,
    "stationName" : "노원구",
    "dataTerm" : "DAILY",
    "ver" : "1.3",
}

params["serviceKey"] = urllib.parse.unquote(params["serviceKey"])
params = urllib.parse.urlencode(params)
params = params.encode("utf-8")

# get 방식으로 url 날림
req = urllib.request.Request(url, data=params)
# request 객체 만들어서 보냄.
res = urllib.request.urlopen(req)

resStr = res.read()

xmlObj = etree.fromstring(resStr)
xmlRoot = etree.ElementTree(xmlObj).getroot()

pm25List = xmlRoot.findall(".//item/pm25Value")
for item in pm25List:
    print(item.tag, item.text)
# output

pm25Value 1
pm25Value 2
pm25Value 3
pm25Value 3
pm25Value 3
pm25Value 4
pm25Value 3
pm25Value 2
pm25Value -
pm25Value -

Comments