Hacker News new | ask | show | jobs
by _delirium 5669 days ago
Just one example of why people need to be careful using this data. It's a very useful source, but I've been seeing a lot of uncritical uses cropping up. For example, some people are using it to track intellectual trends--- compare the graphs of Heidegger and Russell or so on. This can sometimes work, but depends heavily on: 1) uniqueness of names; and 2) the particular set of books included in Google's corpus (especially if comparing people not from the same exact area, like a scientist versus an artist).

Even with relatively unique names, it can be tricky. The case of completely or almost completely unique last names (like "Nietzsche") is easy, but with the available interface to the data, it's difficult to handle cases where First+Last is unique, but last alone isn't. You need to count things like "First Last" and "Last, First", plus variants like "First M. Last", without double-counting.