Hacker News new | ask | show | jobs
by canjobear 1646 days ago
Google Books is not a reliable sample of language over time due to shifts in genre representation that are correlated with all kinds of things like university library policies, interactions of OCR with typesetting practices, etc.

Also note this is a “contributed by” paper which means it didn’t go through the usual PNAS review process. (Presumably the authors didn’t think it would make it through.)

2 comments

Instead of arguing why the data sources they have analyzed is limited or not fully representative, I suggest we get interested in their findings and look further into the matter. The authors are open about the limitations and possible biases in the data sources, and how modern use of language could affect the interpretations. I think they've provided good descriptions on how they have taken this into account in their study.
Even if the data source is biased, it would be interesting to know _how_ it’s biased.

After all, it’s the data that is widely available to people on the internet.

In this particular case, it's biased because Google Books includes much more fiction and many fewer scholarly works after about the year 2000. Link to a response of a previous article by this same group: https://www.pnas.org/content/118/45/e2115010118.short