|
|
|
|
|
by echelon
243 days ago
|
|
8xH100 is pretty wild for a single inference node. Is this what production frontier LLMs are running inference with, or do they consume even more VRAM/compute? At ~$8/hr, assuming a request takes 5 seconds to fulfill, you can service roughly 700ish requests. About $0.01 per request. Is my math wrong? |
|