|
|
|
|
|
by aGHz
4476 days ago
|
|
Like the OP, I needed more control over the crawling behaviour for a project. All the scraping code quickly became a mess though, so I wrote a library that lets you declaratively define the data you're interested in (think Django forms). It also provides decorators that allow you to specify imperative code for organizing the data cleanup before and after parsing. See how easy it is to extract data from the HN front page: https://github.com/aGHz/structominer/blob/master/examples/hn... I'm still working on proper packaging so for the moment the only way to install Struct-o-miner is to clone it from https://github.com/aGHz/structominer. |
|
I've actually built something similar to this myself, I plan on writing an article in the future with something along these lines.
Yours look pretty polished though, good job!