| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by swyx 649 days ago
	> The model, based on Meta Llama 3.1 8B, runs on a Phind-customized NVIDIA TensorRT-LLM inference server that offers extremely fast speeds on H100 GPUs. We start by running the model in FP8, and also enable flash decoding and fused CUDA kernels for MLP. as far as i know you are running your own GPUs - what do you do in overload? have a queue system? what do you do in underload? just eat the costs? is there a "serverless" system here that makes sense/is anyone working on one?

1 comments

rushingcreek 649 days ago

We run the nodes "hot" and close to overload for peak throughput. That's why NVIDIA's XQA innovation was so interesting, because it allows for much higher throughput for a given latency budget: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source....

Serverless would make more sense if we had a significant underutilization problem.

link