Hacker News new | ask | show | jobs
by SethTro 846 days ago
> Phind-70B is significantly faster than GPT-4 Turbo ... We're able to achieve this by running NVIDIA's TensorRT-LLM library on H100 GPUs
2 comments

As someone who has utilized Nvidia Triton Inference Server for years it's really interesting to see people publicly disclosing use of TensorRT-LLM (almost certainly in conjunction with Triton).

Up until TensorRT-LLM Triton had been kind of an in-group secret amongst high scale inference providers. Now you can readily find announcements, press releases, etc of Triton (TensorRT-LLM) usage from the likes of Mistral, Phind, Cloudflare, Amazon, etc.

Being accesible is huge.

I still see post of people running ollama on H100s or whatever, and that's just because its so easy to set up.

How many H100 GPUs does it take to serve 1 Phind-70B model? Are they serving it with bf16, or int8, or lower quants?
This video [1] shows someone running at 4-bit quant in 48gb VRAM. I suspect you need 4x that to run at full f16 precision, or approx 3 H100.

https://www.youtube.com/watch?v=dJ69gY0qRbg

Yeah, 4bit would take 35 GB at least. 16bit would be 140 GB. I'm more interested in how Phind is serving it. But I guess that's their trade secret.