Why don't they just let project maintainers say, "This project contains x, y, and z" or something? That'd at least let them get a leg up on doing the categorization right and I don't think many people would mind having that capability.
+1000. Github routinely detects the wrong language for my projects, and there is no way to manually override it. My take is this: If you want to auto-detect the language, fine... but let the owner of the repo override your detection when it's wrong.
It's probably also a bug to even have the notion of "a language" for a repo given the burgeoning polyglot programming trend. So many repos these days contain multiple languages, especially when you consider javascript, that I question if it even makes sense to say 'This project is in language X' at all.
Like you say, the best option really would be to let the repo owners / maintainers just specify this stuff. They are, after all, the ones who know.
I wish I had more up buttons. Sometimes you can be too smart of your own good, and the good old fashioned way is superior...
Note: I'm not saying they shouldn't have the auto-detection, because it definitely helps if the maintainer doesn't do it, but for those that want to help classify things - let them!
Actually, a way to turn off that feature would be nice. It adds very little value at the cost of it taking days to update. It also marks my dotfiles repo as "VimL" which means any auto-resume tool will assume I know VimL, when I don't. Funny thing is it marks my .vimrc as Perl, not VimL.
I disagree; I think the process should be as streamlined as possible. However, I could see auto-detection balanced with a confidence threshold; which, when not met, would ask user:
"Sorry, I couldn't determine if you had C code in your repo or is that Limbo code?"
I think the idea of automated language detection is pretty cool, but why doesn't github just give you the option of correcting it, or labelling it with the language you prefer?
For example, I've got a javascript modules in repositories. For each module, I make a demo version to show what the module does, and that demo includes a bunch of css. Apparently, there is more css than their is Javascript, so GitHub labels the module as css, but the important part isn't css, the important part is the javascript. In order to resolve this, I've had to move the css into a different repository, and ignore it in the javascript repository. Seems like a long way around, when all I want to do is correct them and say that the module is actually a javascript module.
I actually much prefer BitBucket's way of doing things, for exactly this reason. It doesn't even try to detect - it just asks me. Sometimes the simplest solutions are the best.
Language detection as discussed in the link is per-file. I don't think overriding individual files makes sense since it's likely to be more trouble than it's worth. But I can understand the desire to change the detected language of the project.
Seems like more effort than it's worth still to deal with project-specific settings. AFAIK the only two things this practically affects is syntax highlighting and repository stats. That approach would be a good tradeoff though if things are important enough.
This 'lewellyn' person seems to be complaining about the lack of support for the language Limbo, a language for the Inferno OS. Both seem quite outdated and out of use. He also complains about how Github is focusing on 'cool' kid languages. Which I am guessing refer to modern, popular languages (If this is the definition of cool, then yes, they are.) Which, if I was Github, I would do the same. It's called priorities. I kind of get the vibe that lewellyn is some kind of 'hiptser'. His obscure language is better than the 'cool' kids simply because he's using it. I also would phrase it as "GitHub's language detection is broken", it's merely missing a feature/language.
I suspect his - rather labored - point is that there are multiple use cases to show that the design of Linquist's configuration is flawed as a rule and not an exception, and the lack of attention paid to this particular issue is perhaps indicative of a more general Github attitude towards the less trendy languages and technologies out there.
From my reading of the issue it seems that he's complaining about a limitation of the tool Linguist. There's a suggested fix that doesn't, as far as I can tell, involve changing how Inferno code is written. My understanding is that primary_extension is simply used to short circuit analysis when it's unique to a language. In this case if the primary extension was .inferno and .m was in other extensions it seems that the sample code would be used by the classifier to distinguish between inferno, MATLAB, and obj-c.
To me this comes off as assuming the worst intentions on behalf of the github developers.
Nobody's arguing that GitHub shouldn't prioritize popular languages. I don't think that the responses to this pull request show 'prioritization' however, they show incompetence and close-mindedness instead.
> Basically, Github needs to be accepting of programmers of all stripes, or they are destined to be irrelevant (or at least doing lots of scrambling) once the trendy kids move on from the trendy things they're doing and the currently-popular languages start falling out of style with a reversion to a previous status quo. Github needs to accept that there is a vast wealth of code out there which predates it and which will easily postdate it.
Okay there, buddy. I don't think lack of Lingo support is going to be GitHub's eventual downfall.
Their language detection is indeed terrible. I have a repository (https://github.com/jperkin/pilights) which is entirely composed of shell scripts and a single markdown README. GitHub's analysis?
Perl 83.5% Shell 16.5%
There is not a single .pl or .pm file, nor a single mention of 'perl' anywhere in the repository, and all scripts begin with #!/bin/sh.
A number of my other repositories have similar problems, but this one is by far the worst.
Huh, yes, they appear to have coincidentally fixed it since I wrote that comment. Maybe I need to start reporting all GitHub bugs as Hacker News comments...
> if you'd like Mercury language detection on GitHub then with the current implementation of Linguist you need to pick a different (unique as Objective-C already defines this) primary_extension and add .m to the extensions array which will force Linguist into using the other detection methods mentioned above.
It's just a design error in the original implementation where linguist assumes that the "primary_extension" for any particular language will be unique among all primary_extensions. Obviously that was a mistake, but that's where we are. The comment that set people off was perhaps poorly worded, but it was an honest suggestion to work around the design bug.
A better suggestion: fix the defect. Just delete primary_extension. At best, it does nothing that the extensions array can't do, as it doesn't appear at first glance that any check to primary_extension does not also include a check to extensions.
At worst, it is confusing to implementers and requires chicanery to work around... which is exactly the case we're in. We're in the worst case scenario for this bit of code, and there is no upside to its best-case scenario. Just delete the code.
I bet it's not as easy as "just delete the code". They will probably have to do quite a bit of refactoring to remove this, followed by a probably even larger amount of testing.
Long-term this is probably the right solution, but why go through all this trouble right now if there is a simple workaround? It seems like the only problem right now is a few people's pride.
Probably if the classifier can't determine the language, it will fall back to the primary_extension field - as it should, if the classifier can't determine if a .m file is objc or mercury, it should and will default to objc.
It's the other way around: the extension check is ran before the classifier. And given the limitations of file extensions, an extension collision is so likely of an occurrence (not most likely, but likely enough) as to not bother with primary_extension and just use the collection of extensions as a culling system to minimize the work for classification.
In other words, it seems like the overall design wouldn't be hurt too much by just extricating primary_extension completely. Best-case scenario, primary_extension is equivalent to always having at least one item in extensions. It does nothing else.
Also, this looks like a bug: "if possible_languages.length > 1 ... else possible_languages.first" What if length is 0? That's not greater than 1. I'm not familiar with Ruby, does first return null on an empty array, or does it error? LINQ in .NET has separate First<T> and FirstOrDefault<T> methods: one errors, the other returns default(T) (which is null in the case of reference types). Or is there a default match in the index that occurs when no other language is found?
https://github.com/github/linguist/blob/master/lib/linguist/...
Not instilling a lot of confidence that someone really thought through this bit of code. I'm not saying I very strictly think through everything I write, but I also don't write software for thousands of users, and I acknowledge that I've grown rather complacent in terms of time spent per unit code.
Code that commonly resides in .asp files is completely different from code that commonly resides in .aspx files. They are not synonyms for each other. Also, I would wager that C# aspx files are a tad more common than VB.NET aspx files.
It's even worse than lumping together .c and .cpp. You at least have some chance of getting .c files to compile in a C++ compiler. There is no chance of running ASP code through the ASP.NET engine.
This is why "ASPX" as a term exists, to differentiate from ASP.
Unfortunately, until this core issue is fixed, users can't really submit further pull requests to fix the other issues which would correct the "inflation" we all know and hate.
I’ve been in the programming trenches since early 90’s fluent in 5 languages at the production level and have to say I have never heard of the language 'Limbo’. I don’t fault GitHub one bit.
I suppose I could Google it and act like I know… naw
I don't think the point was to complain about Limbo being missing. I think the point was to show that saying "Objective-C is the only language which can ever use .m as its primary extension" affects more than just the two languages listed earlier in the PR. The PR itself is about Mercury, after all.
Are people even reading the context of the rest of the PR?
I'm pretty sure there's no one with an encyclopaedic knowledge of programming languages. The industry is enormous, and just because you haven't heard of it doesn't mean that it's irrelevant. "Niche" is not the same as "irrelevant".
And even if it WAS irrelevant and only important to a very small number of people, that doesn't mean it can be ignored.
Given infinite time and developer bandwidth, sure. But we don't live in that world, so "do the work that gives the most benefit to the most users" remains the preferable real-world strategy.
From the thread it looks like there are over 6 languages that use .m as the filename extension (including both MATLAB and Mathematica which you may have heard of), meaning the whole concept of a unique "primary_extension" is kind of ludicrous.
I don't really care about Limbo, but GitHub seems to think my .m files are all M(UMPS) files, and not Matlab files, the most obvious choice. Highly annoying.
I don't get it - why is a bunch of people trolling the github project with fairly irrelevant arguments interesting? Could someone who upvoted this explain the logic?
Ignoring the suggested workarounds (setting a unique primary extension and then having the correct extension in the array, for instance) and continuing to rampage in the comments in an attempt to stir up the masses seems like the canonical example of trolling an online community.
Seriously? I have an open-source Matlab project from my time in academia that's been misclassified as an Obj-C project in the past. Less popular languages are used all over, especially for more niche industries.
While it is unfortunate that a pull request on this project has been around for 5 months without much progress, I think the commenter is being a bit dramatic. He is acting as if GitHub is blocking all commits with Limbo code. The language can still be under version control, it just might not have syntax highlighting and its own color in the repo stats bar.
GitHub isn't discriminating against certain programmers. Stay calm and keep coding!
> it just might not have syntax highlighting and its own color in the repo stats bar.
It is discriminating, and harmful to all programmers. We need to be able to easily search for these lesser known languages – they are important cultural works. The commenter points out: "Limbo ... seems to have heavily inspired Go (which is currently extremely fashionable)". We are worse off for not having our history readily accessible.
All they are asking is to arbitrarily specify some other extension as the "primary" extension and have ".m" as another extension. Users will still see the same end result.
I think if I was writing a language detector, it would have these features:
- learning heuristics based on user suggestion.
- extension filtering to differentiate similar languages.
- the algo would use prominence and placement of white space and non-word characters to create the DNA of a language. If the language scores below a threshold against the DNA, it doesn't presume, it asks the user. If a language scores high against this DNA, it still allows used override. Whenever a user would submit their indicator, its file source would be used to train the heuristic.
> My esoteric programming language isn't properly supported by the popular kids' web tool that I'm likely not even paying to use in the first place. I'm OUTRAGED!
What's interesting about this PR is that this case was actually one of the reasons that I created http://www.gitignore.io. GitHub's original repo for .gitignore templates had nearly 1000 open PRs until around Oct 2013 so I built my own repo that would actually accept PRs. Since then, a few employees have worked on accepting PRs, but I had a similar feeling of frustration. Unfortunately, the OP can't just fork this repo because its features are integral to how GitHub works, where as I was able to hack around the system and create a separate product.
The rant linked to appears to misunderstand the problem and the workaround. @arfon admits there's what amounts to a design bug in Linguist, and so to identify ".m" files, you have to identify a different extension as the "primary" and put the real extension into the "alternate" list. That's a hacky workaround, but it would make the pull request work.
The alternative is to fix the design issue. But that's going to be a lot harder and require more than a few days.
@arfon doesn't admit that there's a bug, rather he says that "requiring a unique primary_extension isn't really a 'bug', rather it's a consequence of how language detection works in Linguist."
I can agree that the detection is broken. C++ gets often recognized as C. PHP with some CSS file gets recognized as mostly CSS, etc.
Personally I'd like to have a fixed language that I can set and that the search will use. Next to that, it would be fine for me to statically show what the repository contains, but please use a better language detection, just going by extensions is quite naive.
And that would be part of the problem with Github. Emphasis on "pretty cool" visual flair while letting fundamental architecture fly out that is flatly, and very obviously, just plain broken.
Considering that the comment you're replying to said "that feature is pretty cool", and didn't even need to address the actual linked rant, it seems that not everyone agrees with your "this is just plain broken" viewpoint.
I use Github for the visual flair and cool features. If I wanted to run my own fundamental architecture, I'd be doing that.