|
|
|
|
|
by metamatt
5609 days ago
|
|
What volume do you really need, to get enough data to learn from? I'd think that 1% of Google traffic would still be a pretty big firehose to feed whatever learning algorithm you need to feed. Don't Google, Facebook, et al run a lot of experiments for new projects on a subset of users/queries that's far smaller than 1% of traffic, and still yields very useful results? |
|
1. User mistakenly types a query [mazad], meaning [mazda]. (Probably less than 1% of total queries for Mazda, which is an infinitesimally tiny fraction of the total queries in your system.)
2. The user gets garbage results, and the user realizes their mistake and fixes it, rather than giving up in frustration. This is probably rather rare too, though
3. The user clicks through something that ranked highly for Mazda, and stays there long enough that your system thinks it is a "long click" that probably satisfied the user.
The golden datum here is literally a one in very-many-thousands-of-sessions event, and you need to catch a statistically meaningful number of them for every misspelling (or synonym, or whatever you're trying to learn from this data) you'd like to have your system learn. To have good coverage of the English language, we're talking about many billions of search sessions.
A previous commenter pointed out that Yahoo! probably has enough data; I bet they're right. I don't know if Yahoo! and Bing's technology partnership included access to such data.