Hacker News new | ask | show | jobs
by andy_ppp 236 days ago
How does this work with anything but trivially small context sizes!?
1 comments

Tensor parallelism, so you only need to store a fraction of kv cache per gpu.