|
|
|
|
|
by littlestymaar
602 days ago
|
|
> We can take to gain orders of magnitude more performance just like the leap that the Transformers paper had. Afaik the most important benefit of transformers aren't their “performance” (in the sense of ability to perform their tasks) but their scalability which come from their ability to be trained and evaluated efficiently on big GPU clusters, which isn't something you can do with recurrent neural networks. And then, if I understood correctly, the benefit of state-space models being that you can train them in parallel and run them in a recurrent fashion, making inference cheaper than transformers especially when context size grow. |
|
It was also my understanding that without those attention heads even the scaling up to current parameter sizes we have to day would not have ended up with the level of emergent intelligence that shocked the world with GPT 3.5. We needed both very large models and words put into semantic context in semantic space.