|
|
|
|
|
by sigmoid10
1152 days ago
|
|
I'm well aware of the research that led to it. I was already working in the field back then and I remember that the community was far from realizing how monumental this paper would end up being. Otherwise the authors probably would have considered a more informative or at least less ambiguous title. It also didn't help that the architecture they described (encoder-decoder) was actually even more complicated than what we have now in GPT and the likes. And the really important thing was not that it could train more epochs than recurrent architectures (although that certainly helped the huge models that came later), but it could drastically extend context length for sequence tasks. They went from a theoretically infinite (but in practice very limited) context length to a fundamentally limited but practically obtainable one. |
|