Hacker News new | ask | show | jobs
by Tostino 655 days ago
I hope some of the opensource inference servers start supporting that endpoint soon. I know vLLM has added some "offline batch mode" support with the same format, they just haven't gotten around to implementing it on the OpenAI endpoint yet.
2 comments

Do note it can take up to 24 hours or drop requests altogether. But if that’s not an issue for your use case it’s a great cost saving.
This is neat, I’ve been looking for a way to run our analytics (LLM-based) without affecting the rate limits of our prod app.

May need to give this a try!

What percentage of requests usually get dropped? Is it something miniscule like 1% or are we talking non trivial like 10%
llama.cpp enabled continuous batching by default half a year ago: https://github.com/ggerganov/llama.cpp/pull/6231

There is no need for a new API endpoint. Just send multiple requests at once.

The point of the endpoint is to be able to standardize my codebase and have an agnostic LLM provider that works the same.

Continuous batching is helpful for this type of thing, but it really isn't everything you need. You'd ideally maintain a low priority queue for the batch endpoint and a high priority queue for your real-time chat/completions endpoint.

Would allow utilizing your hardware much better.