|
|
|
|
|
by Too
339 days ago
|
|
When inference requires maxing out the memory of a gpu, where are you planning to keep this cache? Unless there is a way to compress the context into a more manageable snapshot, the cloud provider surely won’t keep a gpu idling just for holding a conversation in memory. |
|