Hacker News new | ask | show | jobs
by anon291 968 days ago
See https://github.com/mit-han-lab/streaming-llm and others. There's good reason to believe that attention networks learn how to update their own weights (Forget the paper) based on their input. The attention mechanism can act like a delta to update weights as the data propagates through the layers. The issue is getting the token embeddings to be more than just the 50k or so that we use for the english language so you can explore the full space, which is what the attention sink mechanism is trying to do.