Hacker News new | ask | show | jobs
by rushingcreek 652 days ago
We run the nodes "hot" and close to overload for peak throughput. That's why NVIDIA's XQA innovation was so interesting, because it allows for much higher throughput for a given latency budget: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source....

Serverless would make more sense if we had a significant underutilization problem.