Hacker News new | ask | show | jobs
by icey 6022 days ago
I'm building something right now that includes page parsing, and so far I've only been building in regex support. I like your jQuery selector idea as well, are there any other ways that you can think of that would make searching the contents of a page programmatically easier for you?
2 comments

May I suggest taking a look at Parsely? Its the syntax they use on www.parselets.com. The documentation for implementing it in your own apps is a little sparse, but the data format is awesome. Here's one that describes scraping HN:

http://parselets.com/parselets/yc/14

Might not be a fit for your project, but in terms of describing parsing instructions to a crawler its the best format I've ever seen.

I'm not crawling, but that is pretty interesting looking. I'll bookmark it and take a look at it for later for sure - thanks!
Hpricot for Ruby is great.

For instance, parsing a Google search results page:

        (doc/"a.l").each do |link|
            label = link.inner_text
            href = link.attributes['href']
            ...