| HN Mirror

In the case of spelling correction and query expansion, every little bit helps. Suppose you want to learn that people typing [mazad] mean [mazda]. (This is kind of a silly example, as dictionary- and edit-distance-based techniques can do corrections like this. So bear with me.) The event you need to catch is:

1. User mistakenly types a query [mazad], meaning [mazda]. (Probably less than 1% of total queries for Mazda, which is an infinitesimally tiny fraction of the total queries in your system.)

2. The user gets garbage results, and the user realizes their mistake and fixes it, rather than giving up in frustration. This is probably rather rare too, though

3. The user clicks through something that ranked highly for Mazda, and stays there long enough that your system thinks it is a "long click" that probably satisfied the user.

The golden datum here is literally a one in very-many-thousands-of-sessions event, and you need to catch a statistically meaningful number of them for every misspelling (or synonym, or whatever you're trying to learn from this data) you'd like to have your system learn. To have good coverage of the English language, we're talking about many billions of search sessions.

A previous commenter pointed out that Yahoo! probably has enough data; I bet they're right. I don't know if Yahoo! and Bing's technology partnership included access to such data.