|
|
|
|
|
by pona-a
907 days ago
|
|
That sounds interesting, but still fails to capture long-range dependencies. If I understand correctly, what you're proposing is to replace co-occurrence frequency with word2vec cosine similarity. I suppose it may help improve overall performance, you're still just blindly predicting the next word based on the previous one like a first order Markov chain would. For example, it won't ever fit "2 plus 2 equals 4," because right when we get to equals, we discard all the previous words. Perhaps if we could get the embedding model to consider the full sentence and then produce a set of probability-scored next token predictions it may work, but now we've just reinvented a transformer. |
|
Instead of only taking in the last "token" as context to the function that generates the next token - take the last 15 tokens (ie. the last 2-3 sentences), and predict based on that. And that's your "attention" mechanism.