| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ynn4k 5608 days ago

A general problem with n-gram is the conundrum of data-sparseness vs reliability of estimation. To have reliable estimation, you need larger order n in n-gram, but it also increases the size of the model which requires larger amount of data and storage. Thanks to the Web as a corpus and cloud computing, we now have upto 5-gram models computable on Terabytes of data provided you are resourceful. One problem with this approach is the selection of the web data to be used for training. The better adaption to the target scenario, the better accuracy.

  i see no services that make use of this.

Most services have proprietary implementations of spell correction that is an amalgamation of several techniques including n-grams, and they might not like to make it public.