| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 1234 days ago

Yeah, the GitHub robots.txt is surprisingly restrictive:

https://github.com/robots.txt

   User-agent: *

   Disallow: /*/pulse
   Disallow: /*/tree/

That "/*/tree" rule means that search engine crawlers are allowed to hit the README file of a repo but effectively NONE of the other files in it.

Which means that if you keep your project documentation on GitHub in a docs/ folder it won't be indexed!

You need to publish it to a separate site via GitHub Pages, or use https://readthedocs.org/

(Side note: I just noticed https://github.com/ekansa/Open-Context-Data is explicitly listed in the robots.txt for GitHub - the only repo that gets a mention like that. I'd love to know the story behind that!)

3 comments

burkaman 1234 days ago

That repo apparently used to be the largest on GitHub: https://news.ycombinator.com/item?id=5912922. I bet Google was repeatedly scraping the entire thing and putting too much strain on their servers at the time it was added. It's been 10 years, what are the odds nobody at GitHub today remembers why it was added?

Also, very relatable to see a decade old "I'll update this shortly" comment that was never updated. We all have a few of those.

link

snowycat 1234 days ago

It appears that the creator of the repo actually confirmed this: https://twitter.com/ekansa/status/1137052076062650368

link

knute 1234 days ago

/*/tree is only for directory listings. File contents will be under a /blob/ path, e.g. https://github.com/facebook/react/blob/main/AUTHORS, and should be, AFAIK, indexable.

(mandatory disclaimer: I'm a GitHub employee, not speaking on behalf of the company)

link

simonw 1234 days ago

I asked about this on the support forum a while ago and never got a satisfactory response: https://github.com/community/community/discussions/20958

link

staplung 1234 days ago

If they can't hit `/*/tree` is there a way to know the URLs of the files?

link

pancrufty 1234 days ago

Direct links from crawlable pages

link

kadoban 1234 days ago

Sure, clone the git repo.

link

utopcell 1234 days ago

GitHub would not be happy with Google cloning all repos, and many of them at a high frequency, in order to circumvent a robots.txt restriction.

link

kadoban 1234 days ago

They're clever people, they could just do partial updates (pull instead of clone). I doubt it'd be that much of a strain.

link

blep_ 1234 days ago

There's also two users:

    Disallow: /account-login
    Disallow: /Explodingstuff/

The first for obvious reasons, the second probably because they've uploaded nothing of substance besides a copy of WannaCry.

link