Hacker News new | ask | show | jobs
by CDillinger 3164 days ago
Interesting how 'css/style.cs' is number 10 on the size list. Is this just a style sheet with the wrong extension? I always assumed GitHub's language detection was based on syntax rather than just extension. Does anyone know if this is the case?
2 comments

I don't know about what's displayed on GitHub.com, but the dataset I queried only looks for a '.cs' extension, so there's a chance that some non C# files got in. The dataset is here https://bigquery.cloud.google.com/table/fh-bigquery:github_e...

Fortunately, most of the queries I've done are aggregations or looking for C# syntax, so there's only a few places non C# code could get in (I already filter out binary files, which I noticed earlier)

By the way, GitHub's language detection is powered by their Linguist library [1].

[1] https://github.com/github/linguist

Thanks for the link, I've always wondered how GitHub does language detection