|
|
|
|
|
by hafabnew
4680 days ago
|
|
Agree that lxml is certainly quite large and batteries-included, but I've always found I can do pretty much everything I want to just using a few select methods, namely etree and xpath. E.g. for extracting URLs from a sitemap: from lxml import etree
root = etree.XML(data)
urls = root.xpath(
'.//sitemap:loc/text()',
namespaces={
'sitemap': 'http://www.sitemaps.org/schemas/sitemap/0.9',
}
)
For RSS reading, here's a straightforward example. Obviously you can factor out the index access on each item, and calling `/text()` too. import urllib2
from lxml import etree
data = urllib2.urlopen('https://news.ycombinator.com/rss').read()
root = etree.XML(data)
for i, item in enumerate(root.xpath('.//item')):
print i, item.xpath('title/text()')[0]
print item.xpath('description/text()')[0]
print item.xpath('link/text()')[0]
print
All pretty simple, xpath selectors can get a bit gnarly at times though. The tradeoff being that you can be very expressive with them. |
|