| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Tostino 655 days ago
	I hope some of the opensource inference servers start supporting that endpoint soon. I know vLLM has added some "offline batch mode" support with the same format, they just haven't gotten around to implementing it on the OpenAI endpoint yet.

2 comments

asaddhamani 655 days ago

Do note it can take up to 24 hours or drop requests altogether. But if that’s not an issue for your use case it’s a great cost saving.

link

jumploops 655 days ago

This is neat, I’ve been looking for a way to run our analytics (LLM-based) without affecting the rate limits of our prod app.

May need to give this a try!

link

altdataseller 655 days ago

What percentage of requests usually get dropped? Is it something miniscule like 1% or are we talking non trivial like 10%

link

johndough 655 days ago

llama.cpp enabled continuous batching by default half a year ago: https://github.com/ggerganov/llama.cpp/pull/6231

There is no need for a new API endpoint. Just send multiple requests at once.

link

Tostino 655 days ago

The point of the endpoint is to be able to standardize my codebase and have an agnostic LLM provider that works the same.

Continuous batching is helpful for this type of thing, but it really isn't everything you need. You'd ideally maintain a low priority queue for the batch endpoint and a high priority queue for your real-time chat/completions endpoint.

Would allow utilizing your hardware much better.

link