| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bluebird 6022 days ago
	Their service would take off much more if they offered a Python, Ruby, or JavaScript API.

1 comments

buro9 6022 days ago

No.

The service would take off much more if instead of defining search patterns as regular expressions they were defined as jquery style expressions that acknowledged DOM and allow you to find all <title> tags that exist in the <header>. Yes you can do this with regexp, but parsing HTML shouldn't be a regexp task.

Oh, I'd like to see email gateways too... point a stream of emails at it and parse those. I'm thinking of scenarios like tripit.com taking in tons of different emails and parsing them to extract travel info.

link

icey 6022 days ago

I'm building something right now that includes page parsing, and so far I've only been building in regex support. I like your jQuery selector idea as well, are there any other ways that you can think of that would make searching the contents of a page programmatically easier for you?

link

qeorge 6022 days ago

May I suggest taking a look at Parsely? Its the syntax they use on www.parselets.com. The documentation for implementing it in your own apps is a little sparse, but the data format is awesome. Here's one that describes scraping HN:

http://parselets.com/parselets/yc/14

Might not be a fit for your project, but in terms of describing parsing instructions to a crawler its the best format I've ever seen.

link

icey 6022 days ago

I'm not crawling, but that is pretty interesting looking. I'll bookmark it and take a look at it for later for sure - thanks!

link

asher 6022 days ago

Hpricot for Ruby is great.

For instance, parsing a Google search results page:

        (doc/"a.l").each do |link|
            label = link.inner_text
            href = link.attributes['href']
            ...

link

jdrock 6022 days ago

This is something we can include as an 80App :) Thanks for the suggestion!

link