Hacker News new | ask | show | jobs
by echelon 243 days ago
8xH100 is pretty wild for a single inference node.

Is this what production frontier LLMs are running inference with, or do they consume even more VRAM/compute?

At ~$8/hr, assuming a request takes 5 seconds to fulfill, you can service roughly 700ish requests. About $0.01 per request.

Is my math wrong?

2 comments

This is the spec for a training node. The inference requires 80GB of VRAM, so significantly less compute.
The default model is ~0.5B params right?
As vessenes wrote, that‘s for training. But a H100 can also process many requests in parallel.