Hacker News new | ask | show | jobs
by bluebird 6022 days ago
Their service would take off much more if they offered a Python, Ruby, or JavaScript API.
1 comments

No.

The service would take off much more if instead of defining search patterns as regular expressions they were defined as jquery style expressions that acknowledged DOM and allow you to find all <title> tags that exist in the <header>. Yes you can do this with regexp, but parsing HTML shouldn't be a regexp task.

Oh, I'd like to see email gateways too... point a stream of emails at it and parse those. I'm thinking of scenarios like tripit.com taking in tons of different emails and parsing them to extract travel info.

I'm building something right now that includes page parsing, and so far I've only been building in regex support. I like your jQuery selector idea as well, are there any other ways that you can think of that would make searching the contents of a page programmatically easier for you?
May I suggest taking a look at Parsely? Its the syntax they use on www.parselets.com. The documentation for implementing it in your own apps is a little sparse, but the data format is awesome. Here's one that describes scraping HN:

http://parselets.com/parselets/yc/14

Might not be a fit for your project, but in terms of describing parsing instructions to a crawler its the best format I've ever seen.

I'm not crawling, but that is pretty interesting looking. I'll bookmark it and take a look at it for later for sure - thanks!
Hpricot for Ruby is great.

For instance, parsing a Google search results page:

        (doc/"a.l").each do |link|
            label = link.inner_text
            href = link.attributes['href']
            ...
This is something we can include as an 80App :) Thanks for the suggestion!