Hacker News new | ask | show | jobs
Access to HN posts
3 points by bashgrep 4838 days ago
I am trying to use ec2 machines to scrape HN posts, but it isn't working. Specifically, it seems like HN is rate limiting request from ec2 machines. Does HN not want people scraping HN? How can I get access to all HN posts?
2 comments

You can scrape HN as long as you respect the robots.txt[1] and don't retreive more than a couple of pages per minute.

Have you considered just pulling the data from HNSearch's API[2] or the one by iHackerNews[3]?

[1] https://news.ycombinator.com/robots.txt

[2] https://www.hnsearch.com/api

[3] http://api.ihackernews.com/

Does HN not want people scraping HN?

I don't remember the details, but I think pg has expressed some desire to not have people scraping the site, or at least not to scrape often. I believe the justification was that too many bots crawling/scraping the site hurts performance for everybody else. You might try searching the old posts for more info on the topic.

How can I get access to all HN posts?

Try http://api.ihackernews.com/