| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by joelm 1125 days ago

Latency has been the biggest challenge for me.

They cite "two to 15+ seconds" in this blog post for responses. Via the OpenAI API I've been seeing more like 45-60 seconds for responses (using GPT-3.5-turbo or GPT-4 in chat mode). Note, this is using ~3500 tokens total.

I've had to extensively adapt to that latency in the UI of our product. Maybe I should start showing funny messages while the user is waiting (like I've seen porkbun do when you pay for domain names).

2 comments

phillipcarter 1125 days ago

Was this in the past week? We had much worse latency this past week compared to the rest (in addition to model unavailability errors), which we attributed to the Microsoft Build conference. One of our customers that uses it a lot is always at the token limit and their average latency was ~5 seconds, but that was closer to 10 second last week.

...also why we can't wait for other vendors to get SOC I/II clearance, and I guess eventually fine-tuning our own model, so we're not stuck with situations like this.

link

joelm 1125 days ago

I've seen more errors lately I think, but no the latency has been an issue for months. I think it has grown some over the last few months, but not a dramatic change.

link

phillipcarter 1125 days ago

Well poop, hope that gets resolved fast. I guess OpenAI can't hire compute platform engineers fast enough!

link

kristjansson 1125 days ago

If a user is waiting on the response, you basically have to stream the result instead of waiting on the entire completion.

link

ukuina 1124 days ago

There's no real benefit to streaming if you are planning to use the LLM output downstream (say, in a SQL query). LLM latency is a major annoyance right now, whether locally-hosted or cloud-based.

link

joelm 1125 days ago

Yea, that is probably a better solution. Not an easy one to refactor into at the moment though.

link