GitHub's language detection is broken

Y	Hacker News new \| ask \| show \| jobs

	GitHub's language detection is broken (github.com)
	79 points by Allan_Smithee 4480 days ago

21 comments

DannoHung 4480 days ago

Why don't they just let project maintainers say, "This project contains x, y, and z" or something? That'd at least let them get a leg up on doing the categorization right and I don't think many people would mind having that capability.

link

mindcrime 4480 days ago

+1000. Github routinely detects the wrong language for my projects, and there is no way to manually override it. My take is this: If you want to auto-detect the language, fine... but let the owner of the repo override your detection when it's wrong.

It's probably also a bug to even have the notion of "a language" for a repo given the burgeoning polyglot programming trend. So many repos these days contain multiple languages, especially when you consider javascript, that I question if it even makes sense to say 'This project is in language X' at all.

Like you say, the best option really would be to let the repo owners / maintainers just specify this stuff. They are, after all, the ones who know.

link

itcmcgrath 4480 days ago

I wish I had more up buttons. Sometimes you can be too smart of your own good, and the good old fashioned way is superior...

Note: I'm not saying they shouldn't have the auto-detection, because it definitely helps if the maintainer doesn't do it, but for those that want to help classify things - let them!

link

ancarda 4480 days ago

Actually, a way to turn off that feature would be nice. It adds very little value at the cost of it taking days to update. It also marks my dotfiles repo as "VimL" which means any auto-resume tool will assume I know VimL, when I don't. Funny thing is it marks my .vimrc as Perl, not VimL.

link

elwell 4480 days ago

I disagree; I think the process should be as streamlined as possible. However, I could see auto-detection balanced with a confidence threshold; which, when not met, would ask user:

"Sorry, I couldn't determine if you had C code in your repo or is that Limbo code?"

link

Allan_Smithee 4480 days ago

You mean like the other source code repository hosts do?

link

pedalpete 4480 days ago

I think the idea of automated language detection is pretty cool, but why doesn't github just give you the option of correcting it, or labelling it with the language you prefer?

For example, I've got a javascript modules in repositories. For each module, I make a demo version to show what the module does, and that demo includes a bunch of css. Apparently, there is more css than their is Javascript, so GitHub labels the module as css, but the important part isn't css, the important part is the javascript. In order to resolve this, I've had to move the css into a different repository, and ignore it in the javascript repository. Seems like a long way around, when all I want to do is correct them and say that the module is actually a javascript module.

link

ZoFreX 4480 days ago

I actually much prefer BitBucket's way of doing things, for exactly this reason. It doesn't even try to detect - it just asks me. Sometimes the simplest solutions are the best.

link

michaelmior 4480 days ago

Language detection as discussed in the link is per-file. I don't think overriding individual files makes sense since it's likely to be more trouble than it's worth. But I can understand the desire to change the detected language of the project.

link

Allan_Smithee 4480 days ago

How 'bout a project-specific property list that looks something like this:

.rb=RealBasic .m=Mercury .pl=Prolog .js=SomeCrapOrOther …

link

michaelmior 4480 days ago

Seems like more effort than it's worth still to deal with project-specific settings. AFAIK the only two things this practically affects is syntax highlighting and repository stats. That approach would be a good tradeoff though if things are important enough.

link

013 4480 days ago

This 'lewellyn' person seems to be complaining about the lack of support for the language Limbo, a language for the Inferno OS. Both seem quite outdated and out of use. He also complains about how Github is focusing on 'cool' kid languages. Which I am guessing refer to modern, popular languages (If this is the definition of cool, then yes, they are.) Which, if I was Github, I would do the same. It's called priorities. I kind of get the vibe that lewellyn is some kind of 'hiptser'. His obscure language is better than the 'cool' kids simply because he's using it. I also would phrase it as "GitHub's language detection is broken", it's merely missing a feature/language.

link

choult 4480 days ago

I suspect his - rather labored - point is that there are multiple use cases to show that the design of Linquist's configuration is flawed as a rule and not an exception, and the lack of attention paid to this particular issue is perhaps indicative of a more general Github attitude towards the less trendy languages and technologies out there.

link

scott_s 4480 days ago

Which I think is an uncharitable way of saying "Github prioritizes working on things that will impact the most people."

link

hackcasual 4480 days ago

From my reading of the issue it seems that he's complaining about a limitation of the tool Linguist. There's a suggested fix that doesn't, as far as I can tell, involve changing how Inferno code is written. My understanding is that primary_extension is simply used to short circuit analysis when it's unique to a language. In this case if the primary extension was .inferno and .m was in other extensions it seems that the sample code would be used by the classifier to distinguish between inferno, MATLAB, and obj-c.

To me this comes off as assuming the worst intentions on behalf of the github developers.

link

nox_ 4480 days ago

> My understanding is that primary_extension is simply used to short circuit analysis when it's unique to a language.

No, the primary_extension is only used in a gists_helper.rb file outside the Linguist repos. Note that the feature is deprecated anyway.

https://github.com/github/linguist/blob/master/lib/linguist/...

link

DCKing 4480 days ago

Nobody's arguing that GitHub shouldn't prioritize popular languages. I don't think that the responses to this pull request show 'prioritization' however, they show incompetence and close-mindedness instead.

link

skywhopper 4480 days ago

More likely a lack of time. If you'd like to see it fixed, start contributing to the fix.

link

nox_ 4480 days ago

Do you mean, a fix like https://github.com/github/linguist/pull/985?

link

DCKing 4480 days ago

A lack of time doesn't cause problems with multiple languages that only use the trivial ".m" extension. Bad design does.

Luckily I use languages popular enough to be classified correctly.

link

apetresc 4480 days ago

Yeah, kind of a self-important hipster too.

> Basically, Github needs to be accepting of programmers of all stripes, or they are destined to be irrelevant (or at least doing lots of scrambling) once the trendy kids move on from the trendy things they're doing and the currently-popular languages start falling out of style with a reversion to a previous status quo. Github needs to accept that there is a vast wealth of code out there which predates it and which will easily postdate it.

Okay there, buddy. I don't think lack of Lingo support is going to be GitHub's eventual downfall.

link

jperkin 4480 days ago

Their language detection is indeed terrible. I have a repository (https://github.com/jperkin/pilights) which is entirely composed of shell scripts and a single markdown README. GitHub's analysis?

  Perl 83.5%	  Shell 16.5%

There is not a single .pl or .pm file, nor a single mention of 'perl' anywhere in the repository, and all scripts begin with #!/bin/sh.

A number of my other repositories have similar problems, but this one is by far the worst.

link

LeonidasXIV 4480 days ago

Sorry, I see a huge blue bar saying 100% shell.

link

theOnliest 4480 days ago

Earlier this morning when I looked it looked like OP said...it looks like something has changed in the three hours or so intervening.

link

jperkin 4479 days ago

Huh, yes, they appear to have coincidentally fixed it since I wrote that comment. Maybe I need to start reporting all GitHub bugs as Hacker News comments...

link

bru 4480 days ago

Inciminated comment: https://github.com/github/linguist/pull/748#issuecomment-374...

> if you'd like Mercury language detection on GitHub then with the current implementation of Linguist you need to pick a different (unique as Objective-C already defines this) primary_extension and add .m to the extensions array which will force Linguist into using the other detection methods mentioned above.

link

moron4hire 4480 days ago

what, then, is the point of the primary_extension field?

EDIT: or as I like to yell at Github for Windows when it can't revert out of a merge conflict "WHAT IS EVEN THE POINT OF YOU?!"

link

skywhopper 4480 days ago

It's just a design error in the original implementation where linguist assumes that the "primary_extension" for any particular language will be unique among all primary_extensions. Obviously that was a mistake, but that's where we are. The comment that set people off was perhaps poorly worded, but it was an honest suggestion to work around the design bug.

link

moron4hire 4480 days ago

A better suggestion: fix the defect. Just delete primary_extension. At best, it does nothing that the extensions array can't do, as it doesn't appear at first glance that any check to primary_extension does not also include a check to extensions.

At worst, it is confusing to implementers and requires chicanery to work around... which is exactly the case we're in. We're in the worst case scenario for this bit of code, and there is no upside to its best-case scenario. Just delete the code.

link

DangerousPie 4480 days ago

I bet it's not as easy as "just delete the code". They will probably have to do quite a bit of refactoring to remove this, followed by a probably even larger amount of testing.

Long-term this is probably the right solution, but why go through all this trouble right now if there is a simple workaround? It seems like the only problem right now is a few people's pride.

link

moron4hire 4480 days ago

Of course it's not that easy. They don't have a compiler with a static checker to show them all the places the field was used :P

link

Allan_Smithee 4480 days ago

How much you wanna bet? (https://github.com/github/linguist/issues/985)

link

RyanZAG 4480 days ago

Probably if the classifier can't determine the language, it will fall back to the primary_extension field - as it should, if the classifier can't determine if a .m file is objc or mercury, it should and will default to objc.

Classification is never 100% accurate.

EDIT: Exact method that it is used is reported here: https://github.com/github/linguist/pull/748#issuecomment-374...

link

moron4hire 4480 days ago

It's the other way around: the extension check is ran before the classifier. And given the limitations of file extensions, an extension collision is so likely of an occurrence (not most likely, but likely enough) as to not bother with primary_extension and just use the collection of extensions as a culling system to minimize the work for classification.

In other words, it seems like the overall design wouldn't be hurt too much by just extricating primary_extension completely. Best-case scenario, primary_extension is equivalent to always having at least one item in extensions. It does nothing else.

Also, this looks like a bug: "if possible_languages.length > 1 ... else possible_languages.first" What if length is 0? That's not greater than 1. I'm not familiar with Ruby, does first return null on an empty array, or does it error? LINQ in .NET has separate First<T> and FirstOrDefault<T> methods: one errors, the other returns default(T) (which is null in the case of reference types). Or is there a default match in the index that occurs when no other language is found? https://github.com/github/linguist/blob/master/lib/linguist/...

Not instilling a lot of confidence that someone really thought through this bit of code. I'm not saying I very strictly think through everything I write, but I also don't write software for thousands of users, and I acknowledge that I've grown rather complacent in terms of time spent per unit code.

link

moron4hire 4480 days ago

I'm looking at more of this, and jesus christ, this is completely wrong: https://github.com/github/linguist/blob/master/lib/linguist/...

Code that commonly resides in .asp files is completely different from code that commonly resides in .aspx files. They are not synonyms for each other. Also, I would wager that C# aspx files are a tad more common than VB.NET aspx files.

It's even worse than lumping together .c and .cpp. You at least have some chance of getting .c files to compile in a C++ compiler. There is no chance of running ASP code through the ASP.NET engine.

This is why "ASPX" as a term exists, to differentiate from ASP.

link

eCa 4480 days ago

I have a few Perl projects on Github that uses Bootstrap. Main language (according to Github): Javascript.

I expect that Javascript's github popularity ranking is (a little bit) inflated due to such issues.

link

mindcrime 4480 days ago

I expect it's a lot inflated. I also have repos that are primarily Groovy, but show up as "Javascript" due to the presence of JQuery, Bootstrap, etc.

link

Allan_Smithee 4480 days ago

Unfortunately, until this core issue is fixed, users can't really submit further pull requests to fix the other issues which would correct the "inflation" we all know and hate.

link

natebrennand 4480 days ago

They actually have a set of files and libraries that they ignore (.DS_Store/jquery/boostrap/etc)

https://github.com/github/linguist/blob/master/lib/linguist/...

link

cl8ton 4480 days ago

I’ve been in the programming trenches since early 90’s fluent in 5 languages at the production level and have to say I have never heard of the language 'Limbo’. I don’t fault GitHub one bit.

I suppose I could Google it and act like I know… naw

link

Allan_Smithee 4480 days ago

I don't think the point was to complain about Limbo being missing. I think the point was to show that saying "Objective-C is the only language which can ever use .m as its primary extension" affects more than just the two languages listed earlier in the PR. The PR itself is about Mercury, after all.

Are people even reading the context of the rest of the PR?

link

rkangel 4480 days ago

I'm pretty sure there's no one with an encyclopaedic knowledge of programming languages. The industry is enormous, and just because you haven't heard of it doesn't mean that it's irrelevant. "Niche" is not the same as "irrelevant".

And even if it WAS irrelevant and only important to a very small number of people, that doesn't mean it can be ignored.

link

nahname 4480 days ago

>And even if it WAS irrelevant and only important to a very small number of people, that doesn't mean it can be ignored.

I don't follow. That sounds like the exact criteria for something to be ignored.

link

moron4hire 4480 days ago

This really explains SV's homeless problem.

link

akerl_ 4480 days ago

Given infinite time and developer bandwidth, sure. But we don't live in that world, so "do the work that gives the most benefit to the most users" remains the preferable real-world strategy.

link

nox_ 4480 days ago

Except that primary_extension does not serve any purpose.

link

kalleboo 4480 days ago

From the thread it looks like there are over 6 languages that use .m as the filename extension (including both MATLAB and Mathematica which you may have heard of), meaning the whole concept of a unique "primary_extension" is kind of ludicrous.

link

fastball 4480 days ago

I thought that Mathematica uses .nb extension?

link

cowsandmilk 4480 days ago

Mathematica notebooks use .nb ; but Mathematica scripts generally use .m [1]

[1] https://reference.wolfram.com/mathematica/tutorial/Mathemati...

link

KingMob 4480 days ago

True. But I'll bet you've heard of Matlab, which also uses .m, and is just as old as Obj-C. Matlab is everywhere in scientific computing.

link

Allan_Smithee 4480 days ago

Five languages across two decades?

Stand back, gents! This one is a champion!

link

Aqwis 4480 days ago

I don't really care about Limbo, but GitHub seems to think my .m files are all M(UMPS) files, and not Matlab files, the most obvious choice. Highly annoying.

link

RyanZAG 4480 days ago

I don't get it - why is a bunch of people trolling the github project with fairly irrelevant arguments interesting? Could someone who upvoted this explain the logic?

link

girvo 4480 days ago

How is having a differing opinion "trolling"? Seriously, this word has lost all meaning at this point.

link

akerl_ 4480 days ago

Ignoring the suggested workarounds (setting a unique primary extension and then having the correct extension in the array, for instance) and continuing to rampage in the comments in an attempt to stir up the masses seems like the canonical example of trolling an online community.

link

girvo 4480 days ago

Except he actually has a point. GitHub's default behaviour is broken.

In my years of experience online, trolling was specifically riling someone up by saying things the troll doesn't really believe.

Trolling isn't disagreeing that a workaround is sufficient to ignore an actual issue. But that's just my opinion.

link

Allan_Smithee 4480 days ago

"Troll" is like "terrorist" these days. It has absolutely no semantic content beyond "person I disagree with about something".

link

KingMob 4480 days ago

Seriously? I have an open-source Matlab project from my time in academia that's been misclassified as an Obj-C project in the past. Less popular languages are used all over, especially for more niche industries.

link

jbranchaud 4480 days ago

While it is unfortunate that a pull request on this project has been around for 5 months without much progress, I think the commenter is being a bit dramatic. He is acting as if GitHub is blocking all commits with Limbo code. The language can still be under version control, it just might not have syntax highlighting and its own color in the repo stats bar.

GitHub isn't discriminating against certain programmers. Stay calm and keep coding!

link

bjz_ 4480 days ago

> it just might not have syntax highlighting and its own color in the repo stats bar.

It is discriminating, and harmful to all programmers. We need to be able to easily search for these lesser known languages – they are important cultural works. The commenter points out: "Limbo ... seems to have heavily inspired Go (which is currently extremely fashionable)". We are worse off for not having our history readily accessible.

link

mehwoot 4480 days ago

All they are asking is to arbitrarily specify some other extension as the "primary" extension and have ".m" as another extension. Users will still see the same end result.

link

Allan_Smithee 4480 days ago

Unless they use gist.

link

kalleboo 4480 days ago

I miss Mac OS Classic Filetype/Creator codes... Filename extensions are such an ugly hack.

link

hyperpape 4480 days ago

Great example of worse is better in action.

link

deutronium 4480 days ago

Could they use Bayesian classifiers? trained on a corpus of different languages, primarily concentrating on the symbols used in the language.

link

dclowd9901 4480 days ago

I think if I was writing a language detector, it would have these features:

- learning heuristics based on user suggestion.

- extension filtering to differentiate similar languages.

- the algo would use prominence and placement of white space and non-word characters to create the DNA of a language. If the language scores below a threshold against the DNA, it doesn't presume, it asks the user. If a language scores high against this DNA, it still allows used override. Whenever a user would submit their indicator, its file source would be used to train the heuristic.

link

Allan_Smithee 4480 days ago

This is because you likely think before you code.

link

awalton 4480 days ago

> My esoteric programming language isn't properly supported by the popular kids' web tool that I'm likely not even paying to use in the first place. I'm OUTRAGED!

Yep, seems about right.

link

hk__2 4480 days ago

Also, this is untrue. Omgrofl is supported on GitHub, even if nobody uses it.

link

johnduhart 4480 days ago

And posting it to HN of all places is hilarious.

link

joeblau 4480 days ago

What's interesting about this PR is that this case was actually one of the reasons that I created http://www.gitignore.io. GitHub's original repo for .gitignore templates had nearly 1000 open PRs until around Oct 2013 so I built my own repo that would actually accept PRs. Since then, a few employees have worked on accepting PRs, but I had a similar feeling of frustration. Unfortunately, the OP can't just fork this repo because its features are integral to how GitHub works, where as I was able to hack around the system and create a separate product.

link

skywhopper 4480 days ago

The rant linked to appears to misunderstand the problem and the workaround. @arfon admits there's what amounts to a design bug in Linguist, and so to identify ".m" files, you have to identify a different extension as the "primary" and put the real extension into the "alternate" list. That's a hacky workaround, but it would make the pull request work.

The alternative is to fix the design issue. But that's going to be a lot harder and require more than a few days.

link

nbouscal 4480 days ago

@arfon doesn't admit that there's a bug, rather he says that "requiring a unique primary_extension isn't really a 'bug', rather it's a consequence of how language detection works in Linguist."

The work to fix the design issue was already done by @nox, who submitted a pull request which is still open: https://github.com/github/linguist/pull/985

link

hyperpape 4480 days ago

I guess the charitable reading is "this isn't a bug, it's more of a bad design choice, and we can't just fix it overnight".

But I honestly can't tell if that's what he meant, or if it was more of a "not my problem" type of response.

link

Allan_Smithee 4480 days ago

Except that it was totally fixed overnight. No, wait. Not overnight. Over two hours.

Of course that PR isn't being accepted either.

link

eXpl0it3r 4480 days ago

I can agree that the detection is broken. C++ gets often recognized as C. PHP with some CSS file gets recognized as mostly CSS, etc.

Personally I'd like to have a fixed language that I can set and that the search will use. Next to that, it would be fine for me to statically show what the repository contains, but please use a better language detection, just going by extensions is quite naive.

link

nox_ 4480 days ago

> C++ gets often recognized as C.

The disambiguation test for C++ headers is ridiculous:

      matches << Language["C++"] if data.include?("#include <cstdint>")

link

Allan_Smithee 4480 days ago

Well, I expect that's why so much C++ is misrecognized. Not enough people write valid C++, in Github's narrow world view. :)

link

mcovey 4479 days ago

I wish I could pick the language so I could upload shell scripts without extensions, but it doesn't even read the shebang line.

link

johnduhart 4480 days ago

Sorry, but was there an actually something useful in that comment? I couldn't tell over the 6 paragraphs of childish moaning.

link

moron4hire 4480 days ago

Okay, but the automatically updating comments view is pretty cool. I didn't know Github did that. That is pretty awesome.

link

Allan_Smithee 4480 days ago

And that would be part of the problem with Github. Emphasis on "pretty cool" visual flair while letting fundamental architecture fly out that is flatly, and very obviously, just plain broken.

link

akerl_ 4480 days ago

Considering that the comment you're replying to said "that feature is pretty cool", and didn't even need to address the actual linked rant, it seems that not everyone agrees with your "this is just plain broken" viewpoint.

I use Github for the visual flair and cool features. If I wanted to run my own fundamental architecture, I'd be doing that.

link

Allan_Smithee 4480 days ago

"I use Github for the visual flair and cool features."

The software crisis spelled out in a single sentence.

link

moron4hire 4480 days ago

I've since looked at some of the Linguist code, and it's kind of shit.

link