Hacker News new | ask | show | jobs
by RyanZAG 4480 days ago
Probably if the classifier can't determine the language, it will fall back to the primary_extension field - as it should, if the classifier can't determine if a .m file is objc or mercury, it should and will default to objc.

Classification is never 100% accurate.

EDIT: Exact method that it is used is reported here: https://github.com/github/linguist/pull/748#issuecomment-374...

1 comments

It's the other way around: the extension check is ran before the classifier. And given the limitations of file extensions, an extension collision is so likely of an occurrence (not most likely, but likely enough) as to not bother with primary_extension and just use the collection of extensions as a culling system to minimize the work for classification.

In other words, it seems like the overall design wouldn't be hurt too much by just extricating primary_extension completely. Best-case scenario, primary_extension is equivalent to always having at least one item in extensions. It does nothing else.

Also, this looks like a bug: "if possible_languages.length > 1 ... else possible_languages.first" What if length is 0? That's not greater than 1. I'm not familiar with Ruby, does first return null on an empty array, or does it error? LINQ in .NET has separate First<T> and FirstOrDefault<T> methods: one errors, the other returns default(T) (which is null in the case of reference types). Or is there a default match in the index that occurs when no other language is found? https://github.com/github/linguist/blob/master/lib/linguist/...

Not instilling a lot of confidence that someone really thought through this bit of code. I'm not saying I very strictly think through everything I write, but I also don't write software for thousands of users, and I acknowledge that I've grown rather complacent in terms of time spent per unit code.

I'm looking at more of this, and jesus christ, this is completely wrong: https://github.com/github/linguist/blob/master/lib/linguist/...

Code that commonly resides in .asp files is completely different from code that commonly resides in .aspx files. They are not synonyms for each other. Also, I would wager that C# aspx files are a tad more common than VB.NET aspx files.

It's even worse than lumping together .c and .cpp. You at least have some chance of getting .c files to compile in a C++ compiler. There is no chance of running ASP code through the ASP.NET engine.

This is why "ASPX" as a term exists, to differentiate from ASP.