(Side note: I just noticed https://github.com/ekansa/Open-Context-Data is explicitly listed in the robots.txt for GitHub - the only repo that gets a mention like that. I'd love to know the story behind that!)
That repo apparently used to be the largest on GitHub: https://news.ycombinator.com/item?id=5912922. I bet Google was repeatedly scraping the entire thing and putting too much strain on their servers at the time it was added. It's been 10 years, what are the odds nobody at GitHub today remembers why it was added?
Also, very relatable to see a decade old "I'll update this shortly" comment that was never updated. We all have a few of those.
A public git repository is definitely crawlable. Google seems to have given up actively going out of their way to index things that are hard to crawl as they got so big and important it was easier to just tell people "thou must do X or we won't index you and you want to be indexed", but increasingly the content I want to find is in weird little silos.
Curious, if I had the list of repos, is there anything that forbids me from `while read url; do git clone $url data;./train data; rm -rf ./data; done`. Besides licensing, ie ratelimit/throttle, similar question, the search for code across all repos provided by github ui gets throttled pretty fast, what do people do? (not suggestion in a hundred(?) years to do the while loop for this tho ;))
https://github.com/robots.txt
That "/*/tree" rule means that search engine crawlers are allowed to hit the README file of a repo but effectively NONE of the other files in it.Which means that if you keep your project documentation on GitHub in a docs/ folder it won't be indexed!
You need to publish it to a separate site via GitHub Pages, or use https://readthedocs.org/
(Side note: I just noticed https://github.com/ekansa/Open-Context-Data is explicitly listed in the robots.txt for GitHub - the only repo that gets a mention like that. I'd love to know the story behind that!)