Hacker News new | ask | show | jobs
by mebassett 907 days ago
this is old tech - but it had me thinking. Markov chains are picking the next token from a random set and they are giving approximately all possible tokens (from the training set) an equal probability. What if it weighted the probability - say using the inverse vector distance or cosine similarity of the neighbors as a proxy for probability, where the vector embedding came from word2vec...how close would the performance be to a transformer model or even something like lstm rnns ? (I suppose I'm cheating a bit using word2vec here. I might as well just say I'm using the attention mechanism...)
2 comments

That sounds interesting, but still fails to capture long-range dependencies.

If I understand correctly, what you're proposing is to replace co-occurrence frequency with word2vec cosine similarity.

I suppose it may help improve overall performance, you're still just blindly predicting the next word based on the previous one like a first order Markov chain would.

For example, it won't ever fit "2 plus 2 equals 4," because right when we get to equals, we discard all the previous words.

Perhaps if we could get the embedding model to consider the full sentence and then produce a set of probability-scored next token predictions it may work, but now we've just reinvented a transformer.

> "I suppose it may help improve overall performance, you're still just blindly predicting the next word based on the previous one like a first order Markov chain would."

Instead of only taking in the last "token" as context to the function that generates the next token - take the last 15 tokens (ie. the last 2-3 sentences), and predict based on that. And that's your "attention" mechanism.

Yeah, that's just attention with extra steps