| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by coder543 899 days ago

> These data center targeted GPUs can only output that many tokens per second for large batches.

No… my RTX 3090 can output 130 tokens per second with Mistral on batch size 1. A more powerful GPU (with faster memory) should easily be able to crack 200 tokens per second at batch size 1 with Mistral.

At larger batch sizes, the token rate would be enormous.

Microsoft’s high performing Phi-2 model breaks 200 tokens per second on batch size 1 on my RTX 3090. TinyLlama-1.1B is 350 tokens per second, though its usefulness may be questionable.

We’re just used to datacenter GPUs being used for much larger models, which are much slower, and cannot fit on today’s phones.

1 comments

Const-me 899 days ago

I wonder are you using a quantized version of Mistral? NVidia 3090 has 936 GB/second memory bandwidth, so 150 tokens/second = 7.2 GB per token. In the original 16 bits format, the model takes about 13GB.

Anyway, while these datacenter servers can deliver these speeds for a single session, they don’t do that because large batches result in much higher combined throughput.

link

coder543 899 days ago

> I wonder are you using a quantized version of Mistral?

Yes, we’re comparing phone performance versus datacenter GPUs. That is the discussion point I was responding to originally. That person appeared to be asking when phones are going to be faster than datacenters at running these models. Phones are not running un-quantized 7B models. I was using the 4-bit quantized models, which are close to what phones would be able to run, and a very good balance of accuracy vs speed.

> Anyway, while these datacenter servers can deliver these speeds for a single session, they don’t do that because large batches result in much higher combined throughput.

I don’t agree… batching will increase latency slightly, but it shouldn’t affect throughput for a single session much if it is done correctly. I admit it probably will have some effect, of course. The point of batching is to make use of the unused compute resources, balancing compute vs memory bandwidth better. You should still be running through the layers as fast as memory bandwidth allows, not stalling on compute by making the batch size too large. Right?

We don’t see these speeds because datacenter GPUs are running much larger models, as I have said repeatedly. Even GPT-3.5 Turbo is huge by comparison, since it is believed to be 20B parameters. It would run at about a third of the speed of Mistral. But, GPT-4 is where things get really useful, and no one knows (publicly) just how huge that is. It is definitely a lot slower than GPT-3.5, which in turn is a lot slower than Mistral.

link

Const-me 899 days ago

People use batching on servers to optimize throughput for the complete server, not for a single session.

See “throughput (tokens/s) versus concurrency” graph in that article: https://www.predera.com/blog/mistral-7b-performance-analysis...

There’re other interesting graphs there, they also measured the latency. They found a very strong dependency between batch size and latency, both for first token i.e. pre-fill, and time between subsequent tokens. Note how batch size = 40 delivers best throughput in tokens/second for the server, however the first output token takes almost 4 seconds to generate, probably too slow for an interactive chat.

BTW, I used development tools in the browser to measure latency for the free ChatGPT 3.5, and got about 900 milliseconds till the first token. OpenAI probably balanced throughput versus latency very carefully because their user base is large, and that balance directly affects their costs.

link

coder543 899 days ago

The chart you pointed out is very interesting, but it largely supports my point.

The blue line is easiest to read, so let’s look at how the tokens/sec scale for a single user session as the batch size increases. It starts out at about 100 tokens/s for 5 users = 20 tokens/s/user. At the next point, it is about 19t/s/u. Beyond this point, we start losing some ground, but even by the final data point, it is still over 11t/s/u.

The throughput is affected by less than 2x even with the most unreasonably large batch size. (Unreasonable, because the time to first token is unacceptable for an interactive chat, as you pointed out.)

But, with a batch size that is balanced appropriately, the throughput for a single user session is effectively unchanged whether the service is batching at N=3 or N=10. (Or presumably N=1, but the chart doesn’t include that.) The time to first token is also a reasonable 1 second delay, which is similar to what OpenAI is providing in your testing.

So, with the right batching balance, batching increases the total throughput of the server, but does not affect the throughput or latency for any individual session very much. It does have some impact, of course. Model size and quantization seem to have a much larger impact than batching, from an end user standpoint.

link