Hacker News new | ask | show | jobs
by lolpanda 848 days ago
Reddit blocks all search engines from crawling their comments, essentially blocking all user generated contents.

See this directive in https://www.reddit.com/robots.txt `Disallow: /r//comments////*`

But I can still search Reddit on Google. How does Google manage to get the data?

2 comments

I think you're misinterpreting the rule.

The relevant robots.txt rules are:

    User-Agent: *
    Disallow: */comment/*
    Disallow: /r/*/comments/*/*/*/*

The url for the comment sections looks like: https://www.reddit.com/r/[subreddit]/comments/[id]/[slug]/. This doesn't match the above rules because there's only 2 parts after "comments", not 4 parts as specified by the rules.
Now it says "Disallow: /"
Does Reddit really block all search engines, or do "all" search engines abide by No Spiders? If it's freely available it's ripe for scraping, no matter what Reddit may say.