Hacker News new | ask | show | jobs
by ashwing_2005 4456 days ago
This is great. However I have one bone to pick(or rather know if its been taken care of) Scrapy uses xpaths or equivalent representations to scrape. However there are many alternate xpaths to represent the same div. For e.g. Suppose data is to be extracted from the fifth div in a sequence of divs. So it would use that as the xpath. But now say it also has a meaningful class or id attribute. An xpath based on this attribute might be a better choice because this content may not be in the fifth div across all the pages in a site I want to scrape. Is this taken care of by taking the common denominator from many sample pages?
1 comments

Portia uses https://github.com/scrapy/scrapely library for data extraction. It doesn't use XPaths for learning. There are some links to papers in scrapely README; scrapely is largely based on ideas from these papers, but there are many improvements. In short - yes, this is taken care of.