Hacker News new | ask | show | jobs
by carbocation 4565 days ago
The robots.txt from news.ycombinator.com reads as follows:

    User-Agent: * 
    Disallow: /x?
    Disallow: /vote?
    Disallow: /reply?
    Disallow: /submitted?
    Disallow: /submitlink?
    Disallow: /threads?
    Crawl-delay: 30
So nominally you should feel free to set up a scraper that crawls one non-disallowed resource every 30 seconds.
5 comments

I always asked myself where the HN's data portability policy is? When I back up my submissions and comments I am forced to break the rules.

It would be good to have a way to download ALL your stuff. Ask PG?

I see that submitted and threads are also not allowed.

What is a safe limit to crawl this data, if I have to absolutely need that data? 30 mins between users? 1 hour between users?

i am following hnsearch api for long and found that they crawl user submission url and user details urls every 2-4 hours or so .
rp = robotparser.RobotFileParser()

rp.set_url("https://news.ycombinator.com/news/robots.txt")

rp.read()

# Reads the robots.txt

rp.can_fetch("*", 'https://news.ycombinator.com/news')

>>>> True

cool
But /x? is for the next page.
So apparently you can get two pages of ranking, using / and /news2.
Depends on the intent. If it is user-initiated (like say a mobile formatted version of the site), it wouldn't have to be obey the robots.txt, since it is not a crawler, just another web browser.
well i am trying to get user submissions also so may be i have to violate the robots.txt