|
|
|
|
|
by cs702
2000 days ago
|
|
No questions. After giving just a quick skim, this paper looks like great work. The findings are remarkable, and they're presented in clear, to-the-point language. I confess to being a bit shocked that given the same number of parameters, training is 1.65x faster (whoa), generation is 9x faster (wait, what!?), and perplexity is better (which is a flawed measure, but still), and all by using a new form of "curriculum learning" and adding position embeddings to the queries and keys but not the values. And it's so nice to see new ideas and improvements that don't rely on yet more computation or yet more parameters (I'm looking at you, GPT-3). Congratulations! |
|