|
|
|
|
|
by yencabulator
73 days ago
|
|
> I think for the same model wall time is probably a more intuitive metric; at the end of the day what you’re doing is renting GPU time slices This is a bit too much of a simplification. The LLM provider batches multiple customer requests into one GPU/TPU pass over the weights, with minimal latency increase. The LLM provider may in fact be renting GPUs by the second, but the end user isn't. We the end users are essentially timesharing a pool of GPUs without any dedicated "1 vGPU" style resource allocation. In such a setting, charging by "GPU tick" sounds valid, and the various categories of token costs are an approximation of cost+margin. |
|