| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by SethTro 846 days ago
	> Phind-70B is significantly faster than GPT-4 Turbo ... We're able to achieve this by running NVIDIA's TensorRT-LLM library on H100 GPUs

2 comments

kkielhofner 846 days ago

As someone who has utilized Nvidia Triton Inference Server for years it's really interesting to see people publicly disclosing use of TensorRT-LLM (almost certainly in conjunction with Triton).

Up until TensorRT-LLM Triton had been kind of an in-group secret amongst high scale inference providers. Now you can readily find announcements, press releases, etc of Triton (TensorRT-LLM) usage from the likes of Mistral, Phind, Cloudflare, Amazon, etc.

link

brucethemoose2 846 days ago

Being accesible is huge.

I still see post of people running ollama on H100s or whatever, and that's just because its so easy to set up.

link

jxy 846 days ago

How many H100 GPUs does it take to serve 1 Phind-70B model? Are they serving it with bf16, or int8, or lower quants?

link

tarruda 846 days ago

This video [1] shows someone running at 4-bit quant in 48gb VRAM. I suspect you need 4x that to run at full f16 precision, or approx 3 H100.

https://www.youtube.com/watch?v=dJ69gY0qRbg

link

jxy 846 days ago

Yeah, 4bit would take 35 GB at least. 16bit would be 140 GB. I'm more interested in how Phind is serving it. But I guess that's their trade secret.

link