Hacker News new | ask | show | jobs
by punkpeye 426 days ago
What I am noticing with every new Gemini model that comes out is that the time to first token (TTFT) is not great. I guess it is because they gradually transfer computer power from old models to new models as the demand increases.
1 comments

If you’re imagining that 2.5Pro gets dynamically loaded during the time to first token, then you’re vastly overestimating what’s physically possible.

It’s more likely a latency-throughput tradeoff. Your query might get put inside a large batch, for example.