| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jrvarela56 900 days ago
	In your experience, how could these local LLMs become snappier than using streamed API calls? How far are they if not? How soon do you guess they’ll get there? I understand the motivation includes factors other than performance, I’m just curious about performance as it applies to UX.

2 comments

simonw 900 days ago

Honestly I think being able to run any kind of LLM on a phone is a miracle. I'm astonished at how good (and how fast) Mistral 7B runs under MLC Chat on iOS, considering the constraints of the device.

I don't use it as more than a cool demo though, because the large hosted LLMs (I tend to mostly use GPT-4) are massively more powerful.

But... I'm still intrigued at the idea of a local, slow LLM on my phone enhanced with function calling capabilities, and maybe usable for RAG against private data.

The rate of improvement in these smaller models over the past 6 months has been incredible. We may well find useful applications for them even despite their weaknesses compared to GPT-4 etc.

link

jallbrit 899 days ago

How do you use GLT-4 frequently with how low the usage cap is?

link

coder543 900 days ago

What does snappier even mean in this context? The latency from connecting to a server over most network connections isn’t really noticeable when talking about text generation. If the server with a beefy datacenter-class GPU were running the same Mistral you can run on your phone, it would be spitting out hundreds of tokens per second. Most responses would appear on your screen before you blink.

There is no expectation that phones will ever be comparable in performance for LLMs.

Mistral runs at a decent clip on phones, but we’re talking like 11 tokens per second, not hundreds of tokens per second.

Server-based models tend to be only slightly faster than Mistral on my phone because they’re usually running much larger, much more accurate/useful models. Models which currently can’t fit onto phones.

Running models locally is not motivated by performance, except if you’re in places without reliable internet.

link

Const-me 899 days ago

These data center targeted GPUs can only output that many tokens per second for large batches. These tokens are shared between hundreds or even thousands of users concurrently accessing the same server.

That’s why despite these GPUs deliver very high throughput in tokens/second, responses do not appear instantly, and individual users observe non-trivial latency.

Another interesting consequence, running these ML models with batch size = 1 (when running on end-user computers or phones) is practically guaranteed to bottleneck on memory. Computation performance or tensor cores are irrelevant for the use case, the only number which matters is memory bandwidth.

For example, I’ve tested my Mistral implementation on desktop with nVidia 1080Ti versus laptop with Radeon Vega 7 inside Ryzen 5 5600U. The performance difference between them is close to 10x, because memory: 484 GB/second for GDDR5X in the desktop versus 50 GB/second for dual-channel DDR4-3200 in the laptop. This is despite theoretical compute performance only differs by the factor of 6.6, the numbers are 10.6 versus 1.6 TFlops.

link

coder543 899 days ago

> These data center targeted GPUs can only output that many tokens per second for large batches.

No… my RTX 3090 can output 130 tokens per second with Mistral on batch size 1. A more powerful GPU (with faster memory) should easily be able to crack 200 tokens per second at batch size 1 with Mistral.

At larger batch sizes, the token rate would be enormous.

Microsoft’s high performing Phi-2 model breaks 200 tokens per second on batch size 1 on my RTX 3090. TinyLlama-1.1B is 350 tokens per second, though its usefulness may be questionable.

We’re just used to datacenter GPUs being used for much larger models, which are much slower, and cannot fit on today’s phones.

link

Const-me 899 days ago

I wonder are you using a quantized version of Mistral? NVidia 3090 has 936 GB/second memory bandwidth, so 150 tokens/second = 7.2 GB per token. In the original 16 bits format, the model takes about 13GB.

Anyway, while these datacenter servers can deliver these speeds for a single session, they don’t do that because large batches result in much higher combined throughput.

link

coder543 899 days ago

> I wonder are you using a quantized version of Mistral?

Yes, we’re comparing phone performance versus datacenter GPUs. That is the discussion point I was responding to originally. That person appeared to be asking when phones are going to be faster than datacenters at running these models. Phones are not running un-quantized 7B models. I was using the 4-bit quantized models, which are close to what phones would be able to run, and a very good balance of accuracy vs speed.

> Anyway, while these datacenter servers can deliver these speeds for a single session, they don’t do that because large batches result in much higher combined throughput.

I don’t agree… batching will increase latency slightly, but it shouldn’t affect throughput for a single session much if it is done correctly. I admit it probably will have some effect, of course. The point of batching is to make use of the unused compute resources, balancing compute vs memory bandwidth better. You should still be running through the layers as fast as memory bandwidth allows, not stalling on compute by making the batch size too large. Right?

We don’t see these speeds because datacenter GPUs are running much larger models, as I have said repeatedly. Even GPT-3.5 Turbo is huge by comparison, since it is believed to be 20B parameters. It would run at about a third of the speed of Mistral. But, GPT-4 is where things get really useful, and no one knows (publicly) just how huge that is. It is definitely a lot slower than GPT-3.5, which in turn is a lot slower than Mistral.

link

Const-me 899 days ago

People use batching on servers to optimize throughput for the complete server, not for a single session.

See “throughput (tokens/s) versus concurrency” graph in that article: https://www.predera.com/blog/mistral-7b-performance-analysis...

There’re other interesting graphs there, they also measured the latency. They found a very strong dependency between batch size and latency, both for first token i.e. pre-fill, and time between subsequent tokens. Note how batch size = 40 delivers best throughput in tokens/second for the server, however the first output token takes almost 4 seconds to generate, probably too slow for an interactive chat.

BTW, I used development tools in the browser to measure latency for the free ChatGPT 3.5, and got about 900 milliseconds till the first token. OpenAI probably balanced throughput versus latency very carefully because their user base is large, and that balance directly affects their costs.

link