|
|
|
|
|
by porridgeraisin
260 days ago
|
|
This is just from my skim of the paper, take it with a pinch of salt. It's tiny in terms of number of weights. This is because it reuses and refines the same weights across recursion steps, instead of repeating them for each layer which is what stacked transformers are in usual LLMs. However, the FLOPs is the exact same. In usual LLMs you have number of transformer blocks * (per block costs), here you have number of recursion steps * number of blocks(smaller than usual,2 here) * (per block cost) Basically, this needs compute like a 16-block LLM per training step. Because here recursions = 8, and 2 blocks. How many steps depends on dataset used mostly. |
|
I'm particularly keen to see if you could do speech-to-text with this architecture, and replace Whisper for smaller devices.