Hacker News new | ask | show | jobs
by rfoo 483 days ago
... and batching does not help, you batch more requests and get more kvcache to load, still memory-access bound.

MLA made it possible to cache a smaller form of k/v, mitigating (but not completely solve, on shorter context & smaller batches it's still memory-access bound) the problem.