Hacker News new | ask | show | jobs
by leodriesch 1244 days ago
The readme does not seem to be geared towards people not familiar with the topic.

My questions:

- Is this on the run on consumer GPU scale, or run on 8 A100 scale or you can’t run it yourself ever scale? - How does it compare to other language models in quality/abilities? - What is the training data?

2 comments

It does say on there they are training it on the Pile training data. And they have this bit comparing inference with GPT2-XL:

RWKV-3 1.5B on A40 (tf32) = always 0.015 sec/token, tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M

GPT2-XL 1.3B on A40 (tf32) = 0.032 sec/token (for ctxlen 1000), tested using HF, GPU utilization 45% too (interesting), VRAM 9655M

So it looks about twice as fast for inference while using only about 80% as much VRAM. Obviously at such a small size, just 1.5B, you can run it even on consumer GPUs but you could do that with GPT2 as well. If it remains 80% of VRAM usage when scaled up, we’re still talking 282GB once it’s the size of BLOOM w/ 176B parameters. So yeah still 8x A100 40GB cards I guess. Not going to be the Stable Diffusion of LLMs.

I'm pretty sure those numbers are for training, not inference. I've run it on _CPU_ and gotten ~1 token per second.
The large model weights are 14B, so at 16 bits per weight, it won't quite fit on one 3090 or 4090.