|
|
|
|
|
by exe34
260 days ago
|
|
Any idea why the tiny network takes days to run on massive GPUs? is it the large dataset, or the recursive nature of the algorithm? i.e. would a simple question take hours to solve or require a huge amount of memory? I don't have a huge amount of experience in the nitty gritty details and I'm wondering if I'll be able to run some interesting training on a 3090 in a few days. |
|
It's tiny in terms of number of weights. This is because it reuses and refines the same weights across recursion steps, instead of repeating them for each layer which is what stacked transformers are in usual LLMs.
However, the FLOPs is the exact same.
In usual LLMs you have number of transformer blocks * (per block costs), here you have number of recursion steps * number of blocks(smaller than usual,2 here) * (per block cost)
Basically, this needs compute like a 16-block LLM per training step. Because here recursions = 8, and 2 blocks. How many steps depends on dataset used mostly.