|
|
|
|
|
by ani17
70 days ago
|
|
Author here. A bit more context: By day I'm a systems engineer building AI networking infrastructure. So I kept ending up in conversations where I'm not exactly able to wrap my head on the latest inference magic trick. Like when someone mentioned vLLM's paged attention, I knew virtual memory paging, but had no idea someone had applied the same idea to KV cache allocation on GPUs. Github link to the project: https://github.com/Anirudh171202/WhiteLotus |
|