| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by echelon 243 days ago

8xH100 is pretty wild for a single inference node.

Is this what production frontier LLMs are running inference with, or do they consume even more VRAM/compute?

At ~$8/hr, assuming a request takes 5 seconds to fulfill, you can service roughly 700ish requests. About $0.01 per request.

Is my math wrong?

2 comments

This is the spec for a training node. The inference requires 80GB of VRAM, so significantly less compute.

The default model is ~0.5B params right?

As vessenes wrote, that‘s for training. But a H100 can also process many requests in parallel.