|
|
|
|
|
by sshumaker
757 days ago
|
|
It depends on how large the input prompt (previous context) is. Also, if you can keep cache on GPU with a LRU mechanism, for certain workloads it's very efficient. You can also design an API optimized for batch workloads (say the same core prompt with different data for instruct-style reasoning) - that can result in large savings in those scenarios. |
|
Stupid question, but why wouldn't {extremely large slow-write, fast-read memory} + {smaller, very fast-write memory} be a feasible hardware architecture?
If you know many, many cycles ahead what you'll need to have loaded at a specific time.
Or hell, maybe it's time to go back to memory bank switching.