|
|
|
|
|
by vermorel
840 days ago
|
|
The "cost" is storing the state of the LLM after processing the input. My back-of-the-envelop guesstimate gives me 1GB to capture the 8bit state of 70B parameters model (I might be wrong though, insights are welcome), which is quite manageable with NVMe storage for fast reload. The operator would charge per pay per "saved" prompt, plus maybe a fix per call fee to re-load the state. |
|