Hacker News new | ask | show | jobs
by simonw 1234 days ago
Yeah, the GitHub robots.txt is surprisingly restrictive:

https://github.com/robots.txt

   User-agent: *

   Disallow: /*/pulse
   Disallow: /*/tree/
That "/*/tree" rule means that search engine crawlers are allowed to hit the README file of a repo but effectively NONE of the other files in it.

Which means that if you keep your project documentation on GitHub in a docs/ folder it won't be indexed!

You need to publish it to a separate site via GitHub Pages, or use https://readthedocs.org/

(Side note: I just noticed https://github.com/ekansa/Open-Context-Data is explicitly listed in the robots.txt for GitHub - the only repo that gets a mention like that. I'd love to know the story behind that!)

3 comments

That repo apparently used to be the largest on GitHub: https://news.ycombinator.com/item?id=5912922. I bet Google was repeatedly scraping the entire thing and putting too much strain on their servers at the time it was added. It's been 10 years, what are the odds nobody at GitHub today remembers why it was added?

Also, very relatable to see a decade old "I'll update this shortly" comment that was never updated. We all have a few of those.

It appears that the creator of the repo actually confirmed this: https://twitter.com/ekansa/status/1137052076062650368
/*/tree is only for directory listings. File contents will be under a /blob/ path, e.g. https://github.com/facebook/react/blob/main/AUTHORS, and should be, AFAIK, indexable.

(mandatory disclaimer: I'm a GitHub employee, not speaking on behalf of the company)

I asked about this on the support forum a while ago and never got a satisfactory response: https://github.com/community/community/discussions/20958
If they can't hit `/*/tree` is there a way to know the URLs of the files?
Direct links from crawlable pages
Sure, clone the git repo.
GitHub would not be happy with Google cloning all repos, and many of them at a high frequency, in order to circumvent a robots.txt restriction.
They're clever people, they could just do partial updates (pull instead of clone). I doubt it'd be that much of a strain.
There's also two users:

    Disallow: /account-login
    Disallow: /Explodingstuff/
The first for obvious reasons, the second probably because they've uploaded nothing of substance besides a copy of WannaCry.