|
|
|
|
|
by fizx
3710 days ago
|
|
I wrote a language that's basically a superset of this (https://github.com/fizx/parsley/wiki) back in 2008 and used it to crawl a variety of insane job posting sites. As crawling complexity increases, pretty soon you want an actual programming language to specify things like crawl order and cache behavior. Multi-page behavior was very hard to describe declaratively for misbehaving sites. Also, it's a terrible default (for security reasons) to let the web pages you're parsing automagically initiate new requests to arbitrary urls. Such as it is, I believe that the following works in some version of parsley, though I doubt its an official release. {
"articles": [ {
"title": ".title a",
"comment_link": "follow(.subtext a:nth-child(3) @href) .athing:nth-child(1) .default"
} ]
}
At some point, these json things might as well be as readable as regex :/ |
|
Right. We'd have to only grab the article-id, validate that it is in fact an interger in the right range, and only then piece the url back together and request it.
On the other hand, maybe just checking that we stay within the domain is enough. If the website wants to screw with us, they can send us any reply they want to any url anyway.