Any good api to scrape HN other than this?

Y	Hacker News new \| ask \| show \| jobs

	Any good api to scrape HN other than this? (github.com)
	34 points by kaushikfrnd 4565 days ago
	how to scrape HN other then https://github.com/karan/HackerNewsAPI . any good premade library in python ?

19 comments

carbocation 4565 days ago

The robots.txt from news.ycombinator.com reads as follows:

    User-Agent: * 
    Disallow: /x?
    Disallow: /vote?
    Disallow: /reply?
    Disallow: /submitted?
    Disallow: /submitlink?
    Disallow: /threads?
    Crawl-delay: 30

So nominally you should feel free to set up a scraper that crawls one non-disallowed resource every 30 seconds.

link

wslh 4565 days ago

I always asked myself where the HN's data portability policy is? When I back up my submissions and comments I am forced to break the rules.

It would be good to have a way to download ALL your stuff. Ask PG?

link

rrpadhy 4565 days ago

I see that submitted and threads are also not allowed.

What is a safe limit to crawl this data, if I have to absolutely need that data? 30 mins between users? 1 hour between users?

link

kaushikfrnd 4565 days ago

i am following hnsearch api for long and found that they crawl user submission url and user details urls every 2-4 hours or so .

link

nashequilibrium 4565 days ago

rp = robotparser.RobotFileParser()

rp.set_url("https://news.ycombinator.com/news/robots.txt")

rp.read()

# Reads the robots.txt

rp.can_fetch("*", 'https://news.ycombinator.com/news')

>>>> True

link

kaushikfrnd 4565 days ago

cool

link

t0 4565 days ago

But /x? is for the next page.

link

pedrocr 4565 days ago

So apparently you can get two pages of ranking, using / and /news2.

link

randomdata 4565 days ago

Depends on the intent. If it is user-initiated (like say a mobile formatted version of the site), it wouldn't have to be obey the robots.txt, since it is not a crawler, just another web browser.

link

kaushikfrnd 4565 days ago

well i am trying to get user submissions also so may be i have to violate the robots.txt

link

napoleond 4565 days ago

Just use https://www.hnsearch.com, along with https://www.hnsearch.com/rss and https://www.hnsearch.com/bigrss if you want to mimic the front page.

There is rarely a need to scrape HN directly, but if you do make sure your bot is polite (especially with respect to rate limits).

link

kaushikfrnd 4565 days ago

I am trying to fetch all posts,comments plus all user data . I will ty hnsearch .

link

goldenkey 4565 days ago

Yahoo pipes would work really well if you're willing to write a few HTML regexes or dom element selectors.

http://pipes.yahoo.com/pipes/

link

jcla1 4565 days ago

Not a full featured api, but a way to scrape all of HN: http://jcla1.com/blog/2013/05/13/crawling-hackernews/

Disclaimer: It's my own blog

edit: Uses HNSearch, so it doesn't violate the robots.txt and can be crawled faster

link

zerd 4565 days ago

Did you manage to download the whole database that way? Edit: Also, why didn't you use the "start" (offset) parameter?

link

jcla1 4563 days ago

No, not tried to download it yet. Regarding your question, if you try to use a start > 999 you get this error: "Validation error: max limit is 100, max start+limit is 1000", which is why I avoided that parameter.

link

obayesshelton 4565 days ago

You don't even need an api, all you need is an rss reader and read - https://news.ycombinator.com/rss

link

deft 4565 days ago

I wrote an alright one in Python for use in my HN app for BlackBerry 10. Not sure how good it is, but check it out here: https://github.com/krruzic/Reader-YC/tree/master/app

I'm not sure what you're trying to do though. I used beautifulsoup because I couldn't get lxml working on BB10, but if it was switched to using lxml it would be much faster.

link

shamsulbuddy 4565 days ago

http://hnapp.com/ -- This is the best HN Scraped site.. returns data in JSON / RSS format.

link

mikektung 4565 days ago

Depending on what you're trying to do with the data, you may find http://diffbot.com/products/automatic/ helpful for getting the clean article text and categorization in JSON format. It can be used as a complement/augmentation to the great suggestions here for getting the links.

Disclosure: Founder of Diffbot here.

link

dmpayton 4565 days ago

I wrote a Python wrapper for the iHackerNews API, if that helps.

https://github.com/dmpayton/python-ihackernews

link

kaushikfrnd 4565 days ago

i saw your github repo . Wonderful work but saw your api was not working getting some errors when i tried the link http://api.ihackernews.com/by/kaushikfrnd. Can you confirm it will work if i run it on my own server .

link

dmpayton 4564 days ago

Ah, looks like there's an issue with the iHackerNews API itself, which I don't have a hand in. You'll want to hit up @ronnieroller on Twitter. Sorry I can't be of more help. :/

link

droid_w 4565 days ago

There's a twitter feed based on HN - https://twitter.com/newsycombinator

You can use the twitter API and read from there

link

amirouche 4565 days ago

There is hundred of data sets out there why it must always be HN?

link

zerd 4565 days ago

Because quality datasets are hard to get. E.g. on reddit you would just get cats and memes.

link

mvanveen 4565 days ago

I have a ScraPy-based crawler project available at http://github.com/mvanveen/hncrawl

link

cheeaun 4565 days ago

I built https://github.com/cheeaun/node-hnapi

link

kaushikfrnd 4565 days ago

can anyone say me how to get https://news.ycombinator.com/news through hnsearch api . I want the api link -> [http://api.thriftdb.com/api.hnsearch.com/] !!

link

rotub 4565 days ago

https://www.hnsearch.com/api

link

jenjenhar 4565 days ago

Out of curiosity, Why does HN not release an official API?

link

toyg 4565 days ago

I bet it's just a cost/benefit analysis. An API is a way to get more eyeballs by motivating 3rd party developers to integrate and publicise your service. HN does not need that: it has enough traffic as it is, and given the target audience, you would see an instant proliferation of half-assed apps hammering its endpoints. So it would be an additional cost for no real benefit.

The current situation (PG and friends optimise a basic but very accessible website, and a handful of third parties build APIs on top) is much more manageable.

link

code_duck 4565 days ago

My impression is that pg wants to encourage the hacker spirit by providing a bare bones service which could easily have a 'hacked' api built upon it.

link

taliesinb 4565 days ago

My impression is that HN's link and comment data is too valuable for pg to give away.

Certainly, if I have had access to it I know I could do some pretty useful sociology on HN's audience (= the pool of startup hire material).

link

code_duck 4565 days ago

I don't believe that HN restricts or discourages the scraping of HN content in any way... Other than the restrictions here: https://news.ycombinator.com/robots.txt

If you have a fabulous idea for how to use the data contained on this site, I'm sure everyone will be impressed and interested to see it.

link

kaushikfrnd 4565 days ago

i had the same question in my mind . Even reddit have there official api .

link

fakename 4565 days ago

other than this

link

notastartup 4565 days ago

I wrote http://scrape.it and http://scrape.ly to do this.

link

culo 4565 days ago

try these

- https://www.mashape.com/scrape/scrape-it#!documentation

- https://www.mashape.com/karangoel/hnify#!documentation

link

notastartup 4565 days ago

haha good to see someone link it! I am the author of Scrape.it currently on mashape. I also wrote http://scrape.ly for crawling web pages and extracting data.

link