| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aljungberg 1252 days ago

It does say on there they are training it on the Pile training data. And they have this bit comparing inference with GPT2-XL:

RWKV-3 1.5B on A40 (tf32) = always 0.015 sec/token, tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M

GPT2-XL 1.3B on A40 (tf32) = 0.032 sec/token (for ctxlen 1000), tested using HF, GPU utilization 45% too (interesting), VRAM 9655M

So it looks about twice as fast for inference while using only about 80% as much VRAM. Obviously at such a small size, just 1.5B, you can run it even on consumer GPUs but you could do that with GPT2 as well. If it remains 80% of VRAM usage when scaled up, we’re still talking 282GB once it’s the size of BLOOM w/ 176B parameters. So yeah still 8x A100 40GB cards I guess. Not going to be the Stable Diffusion of LLMs.

1 comments

taktoa 1252 days ago

I'm pretty sure those numbers are for training, not inference. I've run it on _CPU_ and gotten ~1 token per second.

link