| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by saddd 1239 days ago
	I have the feeling that whatever you're talking about is explicitly not crawlable.

4 comments

simonw 1239 days ago

Yeah, the GitHub robots.txt is surprisingly restrictive:

https://github.com/robots.txt

   User-agent: *

   Disallow: /*/pulse
   Disallow: /*/tree/

That "/*/tree" rule means that search engine crawlers are allowed to hit the README file of a repo but effectively NONE of the other files in it.

Which means that if you keep your project documentation on GitHub in a docs/ folder it won't be indexed!

You need to publish it to a separate site via GitHub Pages, or use https://readthedocs.org/

(Side note: I just noticed https://github.com/ekansa/Open-Context-Data is explicitly listed in the robots.txt for GitHub - the only repo that gets a mention like that. I'd love to know the story behind that!)

link

burkaman 1239 days ago

That repo apparently used to be the largest on GitHub: https://news.ycombinator.com/item?id=5912922. I bet Google was repeatedly scraping the entire thing and putting too much strain on their servers at the time it was added. It's been 10 years, what are the odds nobody at GitHub today remembers why it was added?

Also, very relatable to see a decade old "I'll update this shortly" comment that was never updated. We all have a few of those.

link

snowycat 1239 days ago

It appears that the creator of the repo actually confirmed this: https://twitter.com/ekansa/status/1137052076062650368

link

knute 1239 days ago

/*/tree is only for directory listings. File contents will be under a /blob/ path, e.g. https://github.com/facebook/react/blob/main/AUTHORS, and should be, AFAIK, indexable.

(mandatory disclaimer: I'm a GitHub employee, not speaking on behalf of the company)

link

simonw 1239 days ago

I asked about this on the support forum a while ago and never got a satisfactory response: https://github.com/community/community/discussions/20958

link

staplung 1239 days ago

If they can't hit `/*/tree` is there a way to know the URLs of the files?

link

pancrufty 1239 days ago

Direct links from crawlable pages

link

kadoban 1239 days ago

Sure, clone the git repo.

link

utopcell 1238 days ago

GitHub would not be happy with Google cloning all repos, and many of them at a high frequency, in order to circumvent a robots.txt restriction.

link

kadoban 1238 days ago

They're clever people, they could just do partial updates (pull instead of clone). I doubt it'd be that much of a strain.

link

blep_ 1239 days ago

There's also two users:

    Disallow: /account-login
    Disallow: /Explodingstuff/

The first for obvious reasons, the second probably because they've uploaded nothing of substance besides a copy of WannaCry.

link

saurik 1239 days ago

A public git repository is definitely crawlable. Google seems to have given up actively going out of their way to index things that are hard to crawl as they got so big and important it was easier to just tell people "thou must do X or we won't index you and you want to be indexed", but increasingly the content I want to find is in weird little silos.

link

sebosp 1239 days ago

Curious, if I had the list of repos, is there anything that forbids me from `while read url; do git clone $url data;./train data; rm -rf ./data; done`. Besides licensing, ie ratelimit/throttle, similar question, the search for code across all repos provided by github ui gets throttled pretty fast, what do people do? (not suggestion in a hundred(?) years to do the while loop for this tho ;))

link

celdon25 1239 days ago

That doesn’t change anything regarding the actual point of the comment.

link

astrange 1239 days ago

Your idea for a search competitor is to ignore robots.txt?

link

VWWHFSfQ 1239 days ago

or an advertising competitor that ignores DNT!

oh wait

link

celdon25 1238 days ago