| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by carbocation 4565 days ago

The robots.txt from news.ycombinator.com reads as follows:

    User-Agent: * 
    Disallow: /x?
    Disallow: /vote?
    Disallow: /reply?
    Disallow: /submitted?
    Disallow: /submitlink?
    Disallow: /threads?
    Crawl-delay: 30

So nominally you should feel free to set up a scraper that crawls one non-disallowed resource every 30 seconds.

5 comments

wslh 4565 days ago

I always asked myself where the HN's data portability policy is? When I back up my submissions and comments I am forced to break the rules.

It would be good to have a way to download ALL your stuff. Ask PG?

link

rrpadhy 4565 days ago

I see that submitted and threads are also not allowed.

What is a safe limit to crawl this data, if I have to absolutely need that data? 30 mins between users? 1 hour between users?

link

kaushikfrnd 4565 days ago

i am following hnsearch api for long and found that they crawl user submission url and user details urls every 2-4 hours or so .

link

nashequilibrium 4565 days ago

rp = robotparser.RobotFileParser()

rp.set_url("https://news.ycombinator.com/news/robots.txt")

rp.read()

# Reads the robots.txt

rp.can_fetch("*", 'https://news.ycombinator.com/news')

>>>> True

link

kaushikfrnd 4565 days ago

cool

link

t0 4565 days ago

But /x? is for the next page.

link

pedrocr 4565 days ago

So apparently you can get two pages of ranking, using / and /news2.

link

randomdata 4565 days ago

Depends on the intent. If it is user-initiated (like say a mobile formatted version of the site), it wouldn't have to be obey the robots.txt, since it is not a crawler, just another web browser.

link

kaushikfrnd 4565 days ago

well i am trying to get user submissions also so may be i have to violate the robots.txt

link