Hacker News new | ask | show | jobs
by austinvhuang 841 days ago
A nice side effect of implementing cpu simd is you just need enough regular RAM, which tends to be far less scarce than VRAM. Nonetheless, I get your point that more aggressive quantization is valuable + will share with the modeling team.
1 comments

True, it's the only way I can for example run Mixtral on a 8GB GPU, but main memory will always have more latency so some tradeoff tends to be worth it. And parts like the prompt batch buffer and most of the context generally have to be in VRAM if you want to use cuBLAS, with OpenBLAS it's maybe less of a problem, but it is slower.