|
|
|
|
|
by creichenbach
3006 days ago
|
|
Our problem is not about auto-completion (we're not dealing with that much data to need sophisticated algorithms for that). What we're doing with our NN is ordering the set of results (matches) we already have. In other words, we're assigning a relevance number in [0, 1] to each result, based on the query string and training based on past user choices (clicking a result). In order to maintain some consistency and robustness, we need our NN to yield similar results for similar word fragments. So if the NN has previously learned meaningful result priorities for "cargo", they should ideally also work out for "carg" (and vice versa) because of the live listing nature of our tool. |
|
I think the best way to do this is to create a second neural network which smooths out fragments into word2vec vectors corresponding to the derived word (or the derived word itself). In both approaches you start by making a dataset where each word in the vocabulary is the output for multiple incorrectly spelled, artificially generated inputs. For example you want to have the inputs "crg", "carg", "argo", "crgo", "cago", "cargo", "cargop" "cartgo" all have outputs to "cargo" in this data, whether it's the string "cargo" itself or the w2vec embedding of it. The approach where w2vec embeddings are the output allows for words like "carg" to be interpreted as something like a median between "car" and "cargo" both as input to your main NN and for training purposes, which might be want you want. There's some info on this here [0] but they use it to regenerate words themselves, which you probably don't want. Note that including the identity/low training error is very important unless you do a preliminary vocabulary check.
The second approach of generating correct spellings instead of approximate vectors fails if it doesn't get a close enough approximation, although it seems if levenstein distance <=2, the approximation can be corrected cheaply [1]. Sorry I couldn't be more of help, I haven't really encountered this type of problem before. Good luck, you have an interesting problem to solve!
[0] https://machinelearnings.co/deep-spelling-9ffef96a24f6 [1] http://norvig.com/spell-correct.html