| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by veselin 334 days ago
	Anybody knows if one can find an inference provider that offers input token caching? It should be almost required for agentic use - first speed, but also almost all conversations start where the previous ended, so cost may end up quite higher with no caching. I would have expected good providers like Together, Fireworks, etc support it, but I can't find it, except if I run vllm myself on self-hosted instances.

2 comments

Alibaba Cloud does: > Supported models. Currently, qwen-max, qwen-plus, qwen-turbo, qwen3-coder-plus support context cache.

I know. I cannot believe lm studio. Ollama. Especially model providers, do not offer this yet.