| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by canjobear 1646 days ago
	Google Books is not a reliable sample of language over time due to shifts in genre representation that are correlated with all kinds of things like university library policies, interactions of OCR with typesetting practices, etc. Also note this is a “contributed by” paper which means it didn’t go through the usual PNAS review process. (Presumably the authors didn’t think it would make it through.)

2 comments

Manheim 1645 days ago

Instead of arguing why the data sources they have analyzed is limited or not fully representative, I suggest we get interested in their findings and look further into the matter. The authors are open about the limitations and possible biases in the data sources, and how modern use of language could affect the interpretations. I think they've provided good descriptions on how they have taken this into account in their study.

link

hcarvalhoalves 1645 days ago

Even if the data source is biased, it would be interesting to know _how_ it’s biased.

After all, it’s the data that is widely available to people on the internet.

link

lillabullero 1644 days ago

In this particular case, it's biased because Google Books includes much more fiction and many fewer scholarly works after about the year 2000. Link to a response of a previous article by this same group: https://www.pnas.org/content/118/45/e2115010118.short

link