| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hydragit 3139 days ago

WebOOB [0] is a good Python framework for scraping websites. It's mostly used to aggregate data from multiple websites by organizing each site backend implement an abstract interface (for example the CapBank abstract interface for parsing banking sites) but it can be used without that part.

On the pure scraping side, it has a "declarative parsing" to avoid painful plain-old procedural code [1]. You can parse pages by simply specifying a bunch of XPaths and indicating a few filters from the library to apply on those XPath elements, for example CleanText to remove whitespace nonsense, Lower (to lower-case), Regexp, CleanDecimal (to parse as number) and a lot more. URL patterns can be associated to a Page class of such declarative parsing. If declarative becomes too verbose, it can always be replaced locally by writing a plain-old Python method.

A set of applications are provided to visualize extracted data, and other niceties are provided for debug easing. Simply put: « Wonderful, Efficient, Beautiful, Outshining, Omnipotent, Brilliant: meet WebOOB ».

[0] http://weboob.org/

[1] http://dev.weboob.org/guides/module.html#parsing-of-pages