Hacker News new | ask | show | jobs
by creichenbach 3006 days ago
Our problem is not about auto-completion (we're not dealing with that much data to need sophisticated algorithms for that). What we're doing with our NN is ordering the set of results (matches) we already have. In other words, we're assigning a relevance number in [0, 1] to each result, based on the query string and training based on past user choices (clicking a result). In order to maintain some consistency and robustness, we need our NN to yield similar results for similar word fragments.

So if the NN has previously learned meaningful result priorities for "cargo", they should ideally also work out for "carg" (and vice versa) because of the live listing nature of our tool.

3 comments

Ah ok I think I get it. So herein lies the problem (let me know if any of this is incorrect): you want to encode fragments of a word similarly to completed versions of the word, without doing inference. The easiest way of augmenting your training data into a word2vec embedding with artificially misspelled inputs doesn't seem like it would work considering how many misspellings there are per word.

I think the best way to do this is to create a second neural network which smooths out fragments into word2vec vectors corresponding to the derived word (or the derived word itself). In both approaches you start by making a dataset where each word in the vocabulary is the output for multiple incorrectly spelled, artificially generated inputs. For example you want to have the inputs "crg", "carg", "argo", "crgo", "cago", "cargo", "cargop" "cartgo" all have outputs to "cargo" in this data, whether it's the string "cargo" itself or the w2vec embedding of it. The approach where w2vec embeddings are the output allows for words like "carg" to be interpreted as something like a median between "car" and "cargo" both as input to your main NN and for training purposes, which might be want you want. There's some info on this here [0] but they use it to regenerate words themselves, which you probably don't want. Note that including the identity/low training error is very important unless you do a preliminary vocabulary check.

The second approach of generating correct spellings instead of approximate vectors fails if it doesn't get a close enough approximation, although it seems if levenstein distance <=2, the approximation can be corrected cheaply [1]. Sorry I couldn't be more of help, I haven't really encountered this type of problem before. Good luck, you have an interesting problem to solve!

[0] https://machinelearnings.co/deep-spelling-9ffef96a24f6 [1] http://norvig.com/spell-correct.html

Ah, so your first suggestion would be basically building an auto-encoder based on training data with correct and incorrect words (fragments). This might work, but it would require a lot of computation: Each word of the vocabulary multiplied by all its similar counterparts. And this wouldn't cover new/unknown terms yet.

What we ended up doing for now is a two-dimensional input layer with per-column one-hot encoding of characters (i.e. one character is one column, 128 rows for the ascii alphabet). Then, apply a convolution with kernel dimensions 3x128, which flattens data to one dimesion and combines three neighboring characters. The second part builds an "assiciation" between neighbors, which helps yielding similar outputs for similar word fragments.

This works quite well, except for some nasty limitations:

- Search queries have a hard limit in length, caused by our input layer dimensions

- Due to varying search query length, input nodes on the right side are often unused/zero, leading an weighting bias on the left side when training. That is, the start of search queries receives more attention that the end. But that's not necessarily a bad thing.

creichenbach: I believe I have an algorithm with working code, for you, but don't want to spam this thread. send me an email if you want to. (on my profile).

[Also. I hate having to spam discussion threads with personal user-to-user comments.. but there's no message user feature. This message will self destruct once it's goal has been achieved.]

You can use a character RNN that reads the fragmented input char by char, and feed its output to your NN.