| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jiqiren 148 days ago
	This release introduces parallel requests with continuous batching for high throughput serving, all-new non-GUI deployment option, new stateful REST API, and a refreshed user interface.

2 comments

observationist 148 days ago

Awesome - having the API, MCP integrations, refined CLI give you everything you might want. I have some things I'd wanted to try with ChainForge and LMStudio that are now almost trivial.

Thanks for the updates!

link

nubg 148 days ago

are parallel requests "free"? or do you half performance when sending two requests in parallel?

link

anon373839 148 days ago

I have seen ~1,300 tokens/sec of total throughout with Llama 3 8B on a MacBook Pro. So no, you don’t halve the performance. But running batched inference takes more memory, so you have to use shorter contexts than if you weren’t batching.

link