|
|
|
|
|
by mebassett
907 days ago
|
|
this is old tech - but it had me thinking. Markov chains are picking the next token from a random set and they are giving approximately all possible tokens (from the training set) an equal probability. What if it weighted the probability - say using the inverse vector distance or cosine similarity of the neighbors as a proxy for probability, where the vector embedding came from word2vec...how close would the performance be to a transformer model or even something like lstm rnns ? (I suppose I'm cheating a bit using word2vec here. I might as well just say I'm using the attention mechanism...) |
|
If I understand correctly, what you're proposing is to replace co-occurrence frequency with word2vec cosine similarity.
I suppose it may help improve overall performance, you're still just blindly predicting the next word based on the previous one like a first order Markov chain would.
For example, it won't ever fit "2 plus 2 equals 4," because right when we get to equals, we discard all the previous words.
Perhaps if we could get the embedding model to consider the full sentence and then produce a set of probability-scored next token predictions it may work, but now we've just reinvented a transformer.