| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jdfedgon 1042 days ago
	Nice achievement. How many users would realistically be able to use it at the same time when running on such a device? I am interested in its scalability.

5 comments

jacquesm 1042 days ago

That's a tricky question. You're going to have to multiplex the use of the device, but since these are mostly 'ping-pong' style uses you can use something called a 'utilization factor' to figure out what a reasonable upper bound is where you still get an answer to your query in acceptable time. The typical mechanism is an input queue with a single worker to use the device. The cut-off is when the queue becomes unacceptably long, in which case you would have to throw an error or be content with waiting (possibly much) longer for your answer. This is usually capped by some hard limit on the length of the queue (for instance: available memory) or the fact that the queue fills up faster than that it can empty even over a complete daily cycle. Once that happens you need more hardware.

link

dekhn 1041 days ago

Actually many inference systems instead batch all requests within a time period and submit them as a single shot. It increases the average latency but handles more requests per unit time. (at least, this is my understanding how production serving of expensive models that support batching work)

link

jacquesm 1041 days ago

I've done a bunch of optimization for GPU code (in CUDA) and there are typically a few bottle necks that really matter:

- memory bandwidth

- interconnect bandwidth between the CPU and GPU

- interconnect bandwidth between GPUs

- thermals and power if you're doing a good job of optimizing the rest

I don't see how a batching mechanism would improve on any of those, superficially it looks as though that would make matters worse rather than better. Can you explain where the advantage comes from?

link

dekhn 1041 days ago

It's a latency vs. throughput tradeoff. I was surprised as well. But most GPUs can do 32 inferences in the same time as they can do 1 inference. They have all the parallel units required and there are significant setup costs that can be amortized since all the inferences share the same model, weights, etc.

https://groq.com/wp-content/uploads/2020/05/GROQP002_V2.2.pd... the "batching" section of https://docs.nvidia.com/deeplearning/tensorrt/archives/tenso... https://le.qun.ch/en/blog/2023/05/13/transformer-batching/

link

jacquesm 1041 days ago

Very interesting, thank you. I will point one of my colleagues that is busy with this stuff to these and I thank you on his behalf as well, it is exactly the kind of thing they are engaged in.

link

ColonelPhantom 1041 days ago

I think in the case of LLM inference the main bottleneck is streaming the weights from VRAM to CU/SM/EU (whatever naming your GPU vendor of choice uses).

If you're doing inference on multiple prompts at the same time by doing batching, you don't take more time in streaming. But each streamed weights gets used for, say, 32 calculations instead of 1, making better use of the GPU's compute resources.

link

gmiller123456 1042 days ago

"Scalability" and "Single Board Computer" don't really belong in the same sentence. That said, today you can get a refurbished mini PC with a lot more power, for a lot less money than the higher end SBCs. But I didn't see any info on how portable this project is to other hardware.

link

m00x 1041 days ago

I think the biggest advantage here is that you can run it on the GPU using shared memory, which I'm not sure how widespread it is on mini PCs (at least not intel NUCs).

You could run it using OpenVINO on IntelCPUs, but the performance would probably take a hit. It would be a lot easier though since you can just use ggml.

link

btbuildem 1042 days ago

Given the low cost of the setup, I'd expect this to be a single-user solution. Maybe something enabling better smart home / smart device interactions?

link

brucethemoose2 1042 days ago

If you can run it as a AI Horde worker, and the home usage is sporadic, you could definitely support more than one person.

Otherwise ~1.5 tokens/s is definitely the minimum you'd want streaming tokens to a single person.

link

tysam_and 1042 days ago

Not many, since it's slow to begin with.

You'll get a log_2-based scaling efficiency with nearly any batchsize increase, pending some limitations (memory, etc).

That should be enough at least to roughly sketch it out.

link