|
|
|
|
|
by veunes
194 days ago
|
|
The 4x growth in prompt length is a fundamental shift. We've quickly moved from "Q&A" mode to "upload full context and analyze" mode. This completely changes infrastructure requirements: KV-caching becomes a necessity, and prefill time becomes a critical metric, often more important than generation speed. That's exactly why models with cheap long context (Gemini, DeepSeek) are winning the race against "smarter" but expensive models. Inference economics are now dictated by context length |
|