| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fizx 3710 days ago

I wrote a language that's basically a superset of this (https://github.com/fizx/parsley/wiki) back in 2008 and used it to crawl a variety of insane job posting sites.

As crawling complexity increases, pretty soon you want an actual programming language to specify things like crawl order and cache behavior. Multi-page behavior was very hard to describe declaratively for misbehaving sites.

Also, it's a terrible default (for security reasons) to let the web pages you're parsing automagically initiate new requests to arbitrary urls.

Such as it is, I believe that the following works in some version of parsley, though I doubt its an official release.

    {
      "articles": [ {
         "title": ".title a",
         "comment_link": "follow(.subtext a:nth-child(3) @href) .athing:nth-child(1) .default"
      } ]
    }

At some point, these json things might as well be as readable as regex :/

2 comments

thomasahle 3709 days ago

> Also, it's a terrible default (for security reasons) to let the web pages you're parsing automagically initiate new requests to arbitrary urls.

Right. We'd have to only grab the article-id, validate that it is in fact an interger in the right range, and only then piece the url back together and request it.

On the other hand, maybe just checking that we stay within the domain is enough. If the website wants to screw with us, they can send us any reply they want to any url anyway.

link

jerf 3709 days ago

"At some point, these json things might as well be as readable as regex :/"

Don't feel :/ . The complexity is essential, and located in the remote website, not your code or your ideas. You still win isolating all the nasty stuff to one and precisely one location. :/ is on them, not you!

link