Hacker News new | ask | show | jobs
Any good api to scrape HN other than this? (github.com)
34 points by kaushikfrnd 4565 days ago
how to scrape HN other then https://github.com/karan/HackerNewsAPI . any good premade library in python ?
19 comments

The robots.txt from news.ycombinator.com reads as follows:

    User-Agent: * 
    Disallow: /x?
    Disallow: /vote?
    Disallow: /reply?
    Disallow: /submitted?
    Disallow: /submitlink?
    Disallow: /threads?
    Crawl-delay: 30
So nominally you should feel free to set up a scraper that crawls one non-disallowed resource every 30 seconds.
I always asked myself where the HN's data portability policy is? When I back up my submissions and comments I am forced to break the rules.

It would be good to have a way to download ALL your stuff. Ask PG?

I see that submitted and threads are also not allowed.

What is a safe limit to crawl this data, if I have to absolutely need that data? 30 mins between users? 1 hour between users?

i am following hnsearch api for long and found that they crawl user submission url and user details urls every 2-4 hours or so .
rp = robotparser.RobotFileParser()

rp.set_url("https://news.ycombinator.com/news/robots.txt")

rp.read()

# Reads the robots.txt

rp.can_fetch("*", 'https://news.ycombinator.com/news')

>>>> True

cool
But /x? is for the next page.
So apparently you can get two pages of ranking, using / and /news2.
Depends on the intent. If it is user-initiated (like say a mobile formatted version of the site), it wouldn't have to be obey the robots.txt, since it is not a crawler, just another web browser.
well i am trying to get user submissions also so may be i have to violate the robots.txt
Just use https://www.hnsearch.com, along with https://www.hnsearch.com/rss and https://www.hnsearch.com/bigrss if you want to mimic the front page.

There is rarely a need to scrape HN directly, but if you do make sure your bot is polite (especially with respect to rate limits).

I am trying to fetch all posts,comments plus all user data . I will ty hnsearch .
Yahoo pipes would work really well if you're willing to write a few HTML regexes or dom element selectors.

http://pipes.yahoo.com/pipes/

Not a full featured api, but a way to scrape all of HN: http://jcla1.com/blog/2013/05/13/crawling-hackernews/

Disclaimer: It's my own blog

edit: Uses HNSearch, so it doesn't violate the robots.txt and can be crawled faster

Did you manage to download the whole database that way? Edit: Also, why didn't you use the "start" (offset) parameter?
No, not tried to download it yet. Regarding your question, if you try to use a start > 999 you get this error: "Validation error: max limit is 100, max start+limit is 1000", which is why I avoided that parameter.
You don't even need an api, all you need is an rss reader and read - https://news.ycombinator.com/rss
I wrote an alright one in Python for use in my HN app for BlackBerry 10. Not sure how good it is, but check it out here: https://github.com/krruzic/Reader-YC/tree/master/app

I'm not sure what you're trying to do though. I used beautifulsoup because I couldn't get lxml working on BB10, but if it was switched to using lxml it would be much faster.

http://hnapp.com/ -- This is the best HN Scraped site.. returns data in JSON / RSS format.
Depending on what you're trying to do with the data, you may find http://diffbot.com/products/automatic/ helpful for getting the clean article text and categorization in JSON format. It can be used as a complement/augmentation to the great suggestions here for getting the links.

Disclosure: Founder of Diffbot here.

I wrote a Python wrapper for the iHackerNews API, if that helps.

https://github.com/dmpayton/python-ihackernews

i saw your github repo . Wonderful work but saw your api was not working getting some errors when i tried the link http://api.ihackernews.com/by/kaushikfrnd. Can you confirm it will work if i run it on my own server .
Ah, looks like there's an issue with the iHackerNews API itself, which I don't have a hand in. You'll want to hit up @ronnieroller on Twitter. Sorry I can't be of more help. :/
There's a twitter feed based on HN - https://twitter.com/newsycombinator

You can use the twitter API and read from there

There is hundred of data sets out there why it must always be HN?
Because quality datasets are hard to get. E.g. on reddit you would just get cats and memes.
I have a ScraPy-based crawler project available at http://github.com/mvanveen/hncrawl
can anyone say me how to get https://news.ycombinator.com/news through hnsearch api . I want the api link -> [http://api.thriftdb.com/api.hnsearch.com/] !!
Out of curiosity, Why does HN not release an official API?
I bet it's just a cost/benefit analysis. An API is a way to get more eyeballs by motivating 3rd party developers to integrate and publicise your service. HN does not need that: it has enough traffic as it is, and given the target audience, you would see an instant proliferation of half-assed apps hammering its endpoints. So it would be an additional cost for no real benefit.

The current situation (PG and friends optimise a basic but very accessible website, and a handful of third parties build APIs on top) is much more manageable.

My impression is that pg wants to encourage the hacker spirit by providing a bare bones service which could easily have a 'hacked' api built upon it.
My impression is that HN's link and comment data is too valuable for pg to give away.

Certainly, if I have had access to it I know I could do some pretty useful sociology on HN's audience (= the pool of startup hire material).

I don't believe that HN restricts or discourages the scraping of HN content in any way... Other than the restrictions here: https://news.ycombinator.com/robots.txt

If you have a fabulous idea for how to use the data contained on this site, I'm sure everyone will be impressed and interested to see it.

i had the same question in my mind . Even reddit have there official api .
other than this
I wrote http://scrape.it and http://scrape.ly to do this.
haha good to see someone link it! I am the author of Scrape.it currently on mashape. I also wrote http://scrape.ly for crawling web pages and extracting data.