| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by littlestymaar 512 days ago

> I’m not an expert on at-scale inference, but they surely can’t have been running at a batch size of more than 1 if they were getting performance that bad on 4xH100… and I’m not even sure how they were getting performance that low even at batch size 1. Batching is essential to serving large token volumes at scale.

That was my first though as well, but from a quick search it looks like Llama.cpp has a default batch size that's quite high (like 256 or 512 I don't remember exactly, which I find surprising for something that's mostly used by local users) so it shouldn't be the issue.

> As the comments on reddit said, those numbers don’t make sense.

Absolutely, hence my question!

1 comments

coder543 512 days ago

Sure, but that default batch size would only matter if the person in question was actually generating and measuring parallel requests, not just measuring the straight line performance of sequential requests... and I have no confidence they were.

link