XPath (libxml2) in Python
Step 1: Install libxml2 using synaptic package manager
Step 2: Create an xml file that you want to traverse.
For example I am using w3school’s xml document http://www.w3schools.com/xpath/books.xml.
We can also use the local file exist on file system.
Step 3: Create a python for example having name xpathcode.py
Open the xpathcode.py import the libxml2 and urllib. Parse the xml file.
import libxml2
import urllib
rss=libxml2.parseDoc(urllib.urlopen('http://www.w3schools.com/xpath/books.xml').read())
Note: If file exist on local file system try like below
import libxml2
import urllib
rss=libxml2.parseDoc(open('books.xml', 'r').read())
Step 4: Now try the following xpath query one by one.
a. Selects the first book element that is the child of the bookstore
nodes=rss.xpathEval('/bookstore/book[1]')
print nodes[0]
Output:
<book category="COOKING"> <title lang="en">Everyday Italian</title> <author>Giada De Laurentiis</author> <year>2005</year> <price>30.00</price> </book>
b. Selects the last book element that is the child of the bookstore element.
nodes=rss.xpathEval('/bookstore/book[last()]')
print nodes[0]
Output:
<book category="WEB"> <title lang="en">Learning XML</title> <author>Erik T. Ray</author> <year>2003</year> <price>39.95</price> </book>
c. Selects the last but one book element that is the child of the bookstore element
nodes=rss.xpathEval('/bookstore/book[last()-1]')
print nodes[0]
Output:
<book category="WEB"> <title lang="en">XQuery Kick Start</title> <author>James McGovern</author> <author>Per Bothner</author> <author>Kurt Cagle</author> <author>James Linn</author> <author>Vaidyanathan Nagarajan</author> <year>2003</year> <price>49.99</price> </book>
d. Selects the first two book elements that are children of the bookstore element
nodes=rss.xpathEval('/bookstore/book[position()<3]')
for i in nodes:
print i
Output:
<book category="COOKING"> <title lang="en">Everyday Italian</title> <author>Giada De Laurentiis</author> <year>2005</year> <price>30.00</price> </book> <book category="CHILDREN"> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book>
e. Selects all the title elements that have an attribute named lang
nodes=rss.xpathEval('//title[@lang]')
for i in nodes:
print I
Output:
<title lang="en">Everyday Italian</title> <title lang="en">Harry Potter</title> <title lang="en">XQuery Kick Start</title> <title lang="en">Learning XML</title>
f. Selects all the title elements that have an attribute named lang with a value of ‘eng’
nodes=rss.xpathEval("//title[@lang='eng']")
if not nodes:
print 'eng not exist'
Output:
eng not exist
g. Selects all the title elements of the book elements of the bookstore element that have a price element with a value greater than 35.00
nodes=rss.xpathEval("/bookstore/book[price>35.00]/title")
for i in nodes:
print I
Output:
<title lang="en">XQuery Kick Start</title> <title lang="en">Learning XML</title>
h. Selects all the title AND price elements of all book elements
nodes=rss.xpathEval("//book/title | //book/price")
for i in nodes:
print I
Output:
<title lang="en">Everyday Italian</title> <price>30.00</price> <title lang="en">Harry Potter</title> <price>29.99</price> <title lang="en">XQuery Kick Start</title> <price>49.99</price> <title lang="en">Learning XML</title> <price>39.95</price>
i. Selects all the title elements of the book element of the bookstore element AND all the price elements in the document
nodes=rss.xpathEval("/bookstore/book/title | //price")
for i in nodes:
print I
Output:
<title lang="en">Everyday Italian</title> <price>30.00</price> <title lang="en">Harry Potter</title> <price>29.99</price> <title lang="en">XQuery Kick Start</title> <price>49.99</price> <title lang="en">Learning XML</title> <price>39.95</price>
j. Select all the title’s text
nodes=rss.xpathEval("/bookstore/book/title/text()")
for i in nodes:
print i
Output:
Everyday Italian Harry Potter XQuery Kick Start Learning XML
for more detail on xpath please visit: http://www.w3schools.com/xpath/default.asp
nice dude.
@thanks mir and recluze
Hey it works fine but I only get the node address as output and not the complete output as yours :/