Hacker News new | ask | show | jobs
by AaronFriel 881 days ago
Variance would be good, and I've also seen significant variance on "cold" request patterns, which may correspond to resources scaling up on the backend of providers.

Would be interesting to see request latency and throughput when API calls occur cold (first data point), and once per hour, minute, and per second with the first N samples dropped.

Also, at least with Azure OpenAI, the AI safety features (filtering & annotations) make a significant difference in time to first token.