|
|
|
|
|
by leodriesch
1244 days ago
|
|
The readme does not seem to be geared towards people not familiar with the topic. My questions: - Is this on the run on consumer GPU scale, or run on 8 A100 scale or you can’t run it yourself ever scale?
- How does it compare to other language models in quality/abilities?
- What is the training data? |
|
RWKV-3 1.5B on A40 (tf32) = always 0.015 sec/token, tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M
GPT2-XL 1.3B on A40 (tf32) = 0.032 sec/token (for ctxlen 1000), tested using HF, GPU utilization 45% too (interesting), VRAM 9655M
So it looks about twice as fast for inference while using only about 80% as much VRAM. Obviously at such a small size, just 1.5B, you can run it even on consumer GPUs but you could do that with GPT2 as well. If it remains 80% of VRAM usage when scaled up, we’re still talking 282GB once it’s the size of BLOOM w/ 176B parameters. So yeah still 8x A100 40GB cards I guess. Not going to be the Stable Diffusion of LLMs.