Hacker News new | ask | show | jobs
by GaggiX 50 days ago
At 4-bit quantization it should already fit quite nicely.
1 comments

Unfortunately not with a reasonable context length.
I've got 139k context with the UD-Q4_K_XL on a 4090, q8_0 ctk/v. Could probably squeeze a little more but that's enough for me for the moment.
Hey, buddy! Can I bum a command line arg list off ya?
The model uses Gated DeltaNet and Gated Attention so the memory usage of the KV cache is very low, even at BF16 precision.
It really depends on what you think a reasonable context length is, but I can get 50k-60k on a 4090.