If that was a serious question, places that do developer tools and programming languages for a living (at least MSFT, I assume others as well) pay ridiculous amounts of money to independent third-party companies that specialize in gathering this kind of data. The raw data was then kept fairly private (marketing + upper mgmt only), but the rank and file would see some of it occasionally when things such as trends on the number of VBA or VB5 or VC++ programmers appeared in slide decks talking about the direction for upcoming versions of the product.
Having working with the raw data, it was pretty fantastic. Segemented by industry/business size, handled issues with multiple programming languages or companies where one section used one language and another used other ones, etc. We even knew which tools and add-ons were used for which languages and which compiler on each platform (i.e. how many commercial shops using C++ targeting linux are using gcc vs. icc?).
But that data was also stunningly expensive. My marketing friends tell me that accurate market data always is.
Interesting. It's a tautology but the Internet sees only the Internet - there's a huge swathe of programming work that just isn't advertised online, so is invisible to TIOBE.
2) I don't go around making statements about how, from one month to the next, someone displaced someone else, or climbed into a certain ranking, or things like that.
3) I let people reweight the chart based on the metrics they like.
I think the numbers LangPop comes up with are pretty good, but by their very nature are a bit fuzzy.
GitHub and StackOverflow started out really biased in terms of their communities - GitHub with Ruby, and StackOverflow with Microsoft languages. Do you think they've sufficiently lost that bias?
On another note, you know what else could be a great source(s) for data? Google Scholar, CiteSeerX and arXiv. It'd be really interesting to compare the language usage between "industry" and academia.
I think StackOverflow has; GitHub still seems a bit biased towards scripting languages like Ruby or Javascript though. In any case, since you show the graphs from the various sources, I don't think it matters if one source is more biased than another. If, for example, you used CodePlex as a data source, you'll see a huge bias towards C# -- but visitors to your site could simply draw their own conclusions from the charts.
I'm not sure I even trust that... I do lots of bindings from OCaml to C, and whereas I consider them to be OCaml projects, GitHub sees they're more C by LOC and counts them as C.
If nearly all ruby programmers put their code on github, and you don't count it but you do count google code, then isn't that a bias i the sample? Wouldn't it make more sense to include all the major code hosting sites, including github?
Having working with the raw data, it was pretty fantastic. Segemented by industry/business size, handled issues with multiple programming languages or companies where one section used one language and another used other ones, etc. We even knew which tools and add-ons were used for which languages and which compiler on each platform (i.e. how many commercial shops using C++ targeting linux are using gcc vs. icc?).
But that data was also stunningly expensive. My marketing friends tell me that accurate market data always is.