I also think filtering out comments would improve it - especially because so many source files include a copyright statement at the top, and the same licenses (MIT, GPL, Apache, etc) are found repeated in many different files and it distorts the results somewhat.
In that particular case, I'd say it's quite interesting indeed!
Presumably "summary" appears quite a lot because C# developers use markup like `<summary>` in their comments, so automated systems can build documentation (I've never used a Microsoft programming language, but a quick search brought me to https://msdn.microsoft.com/en-us/library/z04awywx.aspx ).
In that sense, it's not really a comment anymore: it's one machine-readable language embedded inside another.
That's certainly interesting, to me at least. It tells me about the signal/noise ratio of the language, the prevalence of various forms of documentation (e.g. <summary> is conventional, whilst something like <precondition> is not), etc.
Such terms clearly have an effect on a system's documentation, even if they don't have an effect on the CPU instructions being executed. But I'm a programmer, not a CPU; text files containing source code are my main I/O interface, and they most certainly do contain such markup, and hence I find it interesting to see statistics about. In comparison, I don't step through very much assembly day to day, so I don't really care very much about the compiler output (the part which the comments don't affect). I prefer to reason at the level of the language I'm using, where not only do comments appear, they're very useful!
> Presumably "summary" appears quite a lot because C# developers use markup like `<summary>` in their comments, so automated systems can build documentation
Yes, and the IDE will auto-generate a doc comment with a <summary> because that's pretty much the most basic doc comment you can get.
> In that sense, it's not really a comment anymore: it's one machine-readable language embedded inside another.
My issue is not that it's a comment, it's that it is essentially worthless as your IDE's basic "add method" intention (or whatever) is going to add it automatically.
> That's certainly interesting, to me at least. It tells me about the signal/noise ratio of the language, the prevalence of various forms of documentation (e.g. <summary> is conventional, whilst something like <precondition> is not), etc.
<summary> is not conventional, it's the primary tag used by the C# documentation system and shown by IntelliSense. <precondition> is not that.
> it is essentially worthless as your IDE's basic "add method" intention (or whatever) is going to add it automatically
Just because an IDE will write boilerplate automatically, that doesn't mean the boilerplate wasn't written, checked into version control, presented to developers, etc. Even if such boilerplate were added by an IDE, and hidden from developers (e.g. using code folding), it's still there in the language.
In this case, the language is C#, not e.g. some "C#-like" language which gets preprocessed/transpiled by an IDE into C# by scattering boilerplate around.
Whilst tooling can help us live with a language's deficiencies, they don't remove those deficiencies ;)
Well, there is a sense in which the language you write (which may not be the language you read) is defined by how you interact with the development environment to produce code.
Which is I prefer a language where I just need to learn one language, and not a separate input language because the language-as-read is to unergonomic to write so a different language needs to be defined for productively writing code.
I also think filtering out comments would improve it - especially because so many source files include a copyright statement at the top, and the same licenses (MIT, GPL, Apache, etc) are found repeated in many different files and it distorts the results somewhat.