Hacker News new | ask | show | jobs
by sshumaker 757 days ago
It depends on how large the input prompt (previous context) is. Also, if you can keep cache on GPU with a LRU mechanism, for certain workloads it's very efficient.

You can also design an API optimized for batch workloads (say the same core prompt with different data for instruct-style reasoning) - that can result in large savings in those scenarios.

1 comments

If you can pipeline upcoming requests and tie state to a specific request, doesn't that allow you to change how you design physical memory? (at least for inference)

Stupid question, but why wouldn't {extremely large slow-write, fast-read memory} + {smaller, very fast-write memory} be a feasible hardware architecture?

If you know many, many cycles ahead what you'll need to have loaded at a specific time.

Or hell, maybe it's time to go back to memory bank switching.