Hacker News new | ask | show | jobs
by agentS 4501 days ago
One of the reasons that spelling corrections are so good on Google is (probably) that they are machine learning models trained on query logs(1). i.e. if you search for "hacer news", and then without clicking on any results, issue another query for "hacker news" in a very short timeframe, then it will learn that "hacker" is a good suggestion for "hacer"(2).

Similarly for "Mark as Spam", Priority Inbox, Recommended Videos on Youtube, Voice Recognition on Android, etc.

Note 1: Yes, you could also do a pretty good job by having a model of your problem. i.e. computing a weighted levenstein distance where the weights are the probabilities of making that error. However, I'd argue that this would still be better with centralized data; you can compute much better probability vectors. And regardless, the best solutions in the field will be with the combination of both.

Note 2: All of the above is speculation. While I help write some of the tools that these guys use, I have no knowledge of how they write their software. This is just how I'd do it.

1 comments

I completely agree with the examples you cite. However the parent was giving examples about auto-correction for user's contacts. That is user-specific data and there's no need for it to be shared with a third-party.
Ah, I understand. I interpreted that comment as a list of 3 separate examples where centralization helps.

A nitpick:

Auto-correction for a user's contacts could probably be done on-device, although I'd guess that machine learning across all users will probably massively reduce your success rate. Consider an ambiguous correction; you accidentally type "Gob", but have contacts of "Rob" and "Bob". I imagine that ranking the suggestions can be improved using a globally trained model.