Hacker News new | ask | show | jobs
by joelm 1125 days ago
Latency has been the biggest challenge for me.

They cite "two to 15+ seconds" in this blog post for responses. Via the OpenAI API I've been seeing more like 45-60 seconds for responses (using GPT-3.5-turbo or GPT-4 in chat mode). Note, this is using ~3500 tokens total.

I've had to extensively adapt to that latency in the UI of our product. Maybe I should start showing funny messages while the user is waiting (like I've seen porkbun do when you pay for domain names).

2 comments

Was this in the past week? We had much worse latency this past week compared to the rest (in addition to model unavailability errors), which we attributed to the Microsoft Build conference. One of our customers that uses it a lot is always at the token limit and their average latency was ~5 seconds, but that was closer to 10 second last week.

...also why we can't wait for other vendors to get SOC I/II clearance, and I guess eventually fine-tuning our own model, so we're not stuck with situations like this.

I've seen more errors lately I think, but no the latency has been an issue for months. I think it has grown some over the last few months, but not a dramatic change.
Well poop, hope that gets resolved fast. I guess OpenAI can't hire compute platform engineers fast enough!
If a user is waiting on the response, you basically have to stream the result instead of waiting on the entire completion.
There's no real benefit to streaming if you are planning to use the LLM output downstream (say, in a SQL query). LLM latency is a major annoyance right now, whether locally-hosted or cloud-based.
Yea, that is probably a better solution. Not an easy one to refactor into at the moment though.