| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lostmsu 566 days ago
	This resource looks very bad to me as they don't check batched inference at all. This might make sense now when most people a just running single query at once, but pretty soon almost everything will be running queries in parallel to take advantage of the compute.

1 comments

menaerus 566 days ago

How do you run multiple queries from multiple clients simultaneously on the same HW without affecting each other context?

link

lostmsu 566 days ago

It depends on the framework. Here's a LlamaSharp example: https://github.com/SciSharp/LLamaSharp/blob/master/LLama.Exa...

link

menaerus 566 days ago

My question wasn't about how to run multiple queries against the LLM but rather how is it even possible from transformer architecture PoV to have a single LLM hosting multiple and different end clients. I'm probably missing something but can't figure that out yet.

link

lostmsu 566 days ago

If you have a branchless program, you can execute the same step of the program on multiple different inputs. https://en.wikipedia.org/wiki/SIMD

link