Hacker News new | ask | show | jobs
Ask HN: HN data access
5 points by bashgrep 4850 days ago
How can I get access to all of the submissions on news.ycombinator.com? I don't want the comments, just the posts. It seems like after every HNS outage settings get added to make it more difficult to access the content on HNS. For example, it seems like you have to sign in now to access older posts. Also, the thrift database has only about 4million records but the hids on HNS are in the 5millions.
3 comments

Seems like HNS is rate limiting connection speed for connections from amazon ec2 machines.

HNS/pg, do you not want us scraping HNS? What is the best way to get all the posts?

I think that there are some mirrors floating around. But I wouldn't know where those mirrors are, or what kind of access they allow.
Where did you read about them?
On the original millionshort thread, where somebody pointed out the number of HN mirrors and linked a few.
These all lead to dead ends: https://news.ycombinator.com/item?id=1721105 https://news.ycombinator.com/item?id=1881262

Do you have a link for the post you are talking about?

EDIT: In this thread pg says wait 30 seconds between each request (https://news.ycombinator.com/item?id=1702488), but that doesn't work either.

EDIT: "unimpressive" Was this is what you were referencing?: https://docs.google.com/spreadsheet/ccc?key=0AqL8kR005z0QdEN...

https://news.ycombinator.com/item?id=3911687

It appears to be dead.

EDIT: Another one that appears to be among the living.

http://hackerbra.in/

I am not sure, but I think the IDs count posts and comments. So I think that means about 4 million submissions and over a million comments. Not all submissions draw comments.

Can anyone verify this?

Look at the URLs for a submission (found via its "discuss" link on the post listing page) and for a non-post comment. They are identical in format; only the ID value varies.

For this to work, they must share the same set of ID values.