Hacker News new | ask | show | jobs
by exe34 260 days ago
Any idea why the tiny network takes days to run on massive GPUs? is it the large dataset, or the recursive nature of the algorithm? i.e. would a simple question take hours to solve or require a huge amount of memory?

I don't have a huge amount of experience in the nitty gritty details and I'm wondering if I'll be able to run some interesting training on a 3090 in a few days.

1 comments

This is just from my skim of the paper, take it with a pinch of salt.

It's tiny in terms of number of weights. This is because it reuses and refines the same weights across recursion steps, instead of repeating them for each layer which is what stacked transformers are in usual LLMs.

However, the FLOPs is the exact same.

In usual LLMs you have number of transformer blocks * (per block costs), here you have number of recursion steps * number of blocks(smaller than usual,2 here) * (per block cost)

Basically, this needs compute like a 16-block LLM per training step. Because here recursions = 8, and 2 blocks. How many steps depends on dataset used mostly.

If this is a way to get equivalent results to a much larger network in the same FLOPs but with a fraction of the VRAM, it's transformative.

I'm particularly keen to see if you could do speech-to-text with this architecture, and replace Whisper for smaller devices.

Nvidia's parakeet dropped recently with better performance and 0.6B params, so the rate of progress here looks good, probably next year (or mby the year after) they'll be running no probs
I mean, I wouldn't say it's transformative or bet on it equalling usual LLM performance in general. It's kind of similar to weight reuse you see in RNNs, where the same `h` is maintained throughout. In usual LLMs each block has its own state.

These guys are choosing a middle ground - stacking few transformers, and then reusing the same 2 blocks 8 times over.

It'll be interesting to see what usecases are served well with this approach. Understanding of these architectures' response to these changes are still largely empirical so hard to say ahead of time. My intuition is that for repetitive input signals it could be good - audio processing comes to mind. But complex attention and stuff like in elevenlabs style translation is probably too much to hope for. Whisper type transcription tho, might work.

thank you! I'll need to have a read soon.