EDIT: Attempting to converse with any Q4_K_M 7B parameter model on a 15 Pro Max... the phone just melts down. It feels like it is producing about one token per minute. MLC-Chat can handle 7B parameter models just fine even on a 14 Pro Max, which has less RAM, so I think there is an issue here.
EDIT 2: Even using StableLM, I am experiencing a total crash of the app fairly consistently if I chat in one conversation, then start a new conversation and try to chat in that. On a related note, since chat history is saved... I don't think it's necessary to have a confirmation prompt if the user clicks the "new chat" shortcut in the top right of a chat.
-----
That does seem much nicer than MLC Chat. I really like the selection of models and saving of conversations.
Microsoft recently re-licensed Phi-2 to be MIT instead of non-commercial, so I would love to see that in the list of models. Similarly, there is a Dolphin-Phi fine tune.
The topic of discussion here is Mistral-7B v0.2, which is also missing from the model list, unfortunately. There are a few Mistral fine tunes in the list, but obviously not the same thing.
I also wish I could enable performance metrics to see how many tokens/sec the model was running at after each message, and to see how much RAM is being used.
Wow, thanks so much for taking the time to test it out and share such great feedback!
Thrilled about all those developments! More model options as well as link-based GGUF downloads on the way.
On the 7b models: I’m very sorry for the poor experience. I wouldn’t recommend 7b over Q2_K at the moment, unless you’re on a 16GB iPad (or an Apple Silicon Mac!). This needs to be much clearer, as you observed the consequences can be severe. The larger models, and even 3b Q6_K can be crash prone due to memory pressure. Will work on improve handling of low level out-of-memory errors very soon.
Will also investigate the StableLM crashes, I’m sorry about that! Hopefully Testflight recorded a trace. Just speculating, it may be a similar issue to the larger models, due to the higher-fidelity quant (Q6_K) combined with the context length eventually running out of RAM. Could you give the Q4_K_M a shot? I heard something similar from a friend yesterday, I’m curious if you have a better time with that — perhaps that’s a more sensible default.
Re: the overly-protective new chat alert, I agree, thanks for the suggestion. I’ll incorporate that into the next build. Can I credit you? Let me know how you’d like for me to refer to you, and I’d be happy to.
Finally, please feel free to email me any further feedback, and thanks again for your time and consideration!
I just checked and MLC Chat is running the 3-bit quantized version of Mistral-7B. It works fine on the 14 Pro Max (6GB RAM) without crashing, and is able to stay resident in memory on the 15 Pro Max (8GB RAM) when switching with another not-too-heavy app. 2-bit quantization just feels like a step too far, but I’ll give it a try.
Regarding credit, I definitely don’t need any. Just happy to see someone working on a better LLM app!
FYI, just submitted a new update for review with a few small but hopefully noticeable changes, thanks in no small part to your feedback:
1. StableLM Zephyr 3b Q4_K_M is now the built-in model, replacing the Q6_K variant.
2. More aggressive RAM headroom calculation, with forced fallback to CPU rather than failing to load or crashing.
3. New status indicator for Metal when model is loaded (filled bolt for enabled, vs slashed bolt for disabled.)
4. Metal will now also be enabled for devices with 4GB RAM or less, but only when the selected model can comfortably fit in RAM. Previously, only devices with at least 6GB had Metal enabled.
The fallback does seem to work! Although the 4-bit 7B models only run at 1 token every several seconds.
I still wish Phi-2, Dolphin Phi-2, and TinyLlama-Chat-v1.0 were available, but I understand you have plans to make it easier to download any model in the future.
edit: I can't reply to you below: Do you have the right app, there's no TestFlight just App Store link - if it's ChatOnMac then it should have a dropdown at the top of the chat room to select a model. If it's empty or otherwise bugged out please let me know what you see in the top menu. It filters the available model presets based on how much RAM you have available, so let me know what specific device you have and I can look into it. Thank you.
The model presets are also configurable by forking the bot and loading your own via GitHub (bots run inside sandboxed hidden webviews inside the app). But this is not ergonomically friendly just yet.
I was excited when I saw this, but I'm having trouble with it (and it looks like I'm not the only one). As others have pointed out, the download link on your site does open TestFlight. I've since deleted that version and installed the official version from the AppStore after revisiting this thread in search of answers.
I now have the full version installed on my iPhone 15 pro, and I have added my OpenAI key, but none of the models I've selected (3.5 Turbo, 4, 4 Turbo) work. My messages in the chat have a red exclamation next to them which opens an error message stating 'Load failed' when clicked. If I click 'Retry Message' the entire app crashes.
Apologies for the rough edges and bad experience - I’ve just soft launched without announcement til this post. I will have a hotfix up soon. Thanks for the report.
> Do you have the right app, there's no TestFlight just App Store link
On chatonmac.com, the "Download on the App Store" button does not link the App Store for me either - I get a modal titled "Public Beta & Launch Day News" with "Join the TestFlight Beta" and "Launch Day Newsletter Signup Form".
Hello, I like your app and the ethics you push forward. Do you plan to add the possibility to request for Dall-E 3 images within the chat? I’ve yet to find an app which does that and makes me use my own api key
In your experience, how could these local LLMs become snappier than using streamed API calls? How far are they if not? How soon do you guess they’ll get there?
I understand the motivation includes factors other than performance, I’m just curious about performance as it applies to UX.
Honestly I think being able to run any kind of LLM on a phone is a miracle. I'm astonished at how good (and how fast) Mistral 7B runs under MLC Chat on iOS, considering the constraints of the device.
I don't use it as more than a cool demo though, because the large hosted LLMs (I tend to mostly use GPT-4) are massively more powerful.
But... I'm still intrigued at the idea of a local, slow LLM on my phone enhanced with function calling capabilities, and maybe usable for RAG against private data.
The rate of improvement in these smaller models over the past 6 months has been incredible. We may well find useful applications for them even despite their weaknesses compared to GPT-4 etc.
What does snappier even mean in this context? The latency from connecting to a server over most network connections isn’t really noticeable when talking about text generation. If the server with a beefy datacenter-class GPU were running the same Mistral you can run on your phone, it would be spitting out hundreds of tokens per second. Most responses would appear on your screen before you blink.
There is no expectation that phones will ever be comparable in performance for LLMs.
Mistral runs at a decent clip on phones, but we’re talking like 11 tokens per second, not hundreds of tokens per second.
Server-based models tend to be only slightly faster than Mistral on my phone because they’re usually running much larger, much more accurate/useful models. Models which currently can’t fit onto phones.
Running models locally is not motivated by performance, except if you’re in places without reliable internet.
These data center targeted GPUs can only output that many tokens per second for large batches. These tokens are shared between hundreds or even thousands of users concurrently accessing the same server.
That’s why despite these GPUs deliver very high throughput in tokens/second, responses do not appear instantly, and individual users observe non-trivial latency.
Another interesting consequence, running these ML models with batch size = 1 (when running on end-user computers or phones) is practically guaranteed to bottleneck on memory. Computation performance or tensor cores are irrelevant for the use case, the only number which matters is memory bandwidth.
For example, I’ve tested my Mistral implementation on desktop with nVidia 1080Ti versus laptop with Radeon Vega 7 inside Ryzen 5 5600U. The performance difference between them is close to 10x, because memory: 484 GB/second for GDDR5X in the desktop versus 50 GB/second for dual-channel DDR4-3200 in the laptop. This is despite theoretical compute performance only differs by the factor of 6.6, the numbers are 10.6 versus 1.6 TFlops.
> These data center targeted GPUs can only output that many tokens per second for large batches.
No… my RTX 3090 can output 130 tokens per second with Mistral on batch size 1. A more powerful GPU (with faster memory) should easily be able to crack 200 tokens per second at batch size 1 with Mistral.
At larger batch sizes, the token rate would be enormous.
Microsoft’s high performing Phi-2 model breaks 200 tokens per second on batch size 1 on my RTX 3090. TinyLlama-1.1B is 350 tokens per second, though its usefulness may be questionable.
We’re just used to datacenter GPUs being used for much larger models, which are much slower, and cannot fit on today’s phones.
I wonder are you using a quantized version of Mistral? NVidia 3090 has 936 GB/second memory bandwidth, so 150 tokens/second = 7.2 GB per token. In the original 16 bits format, the model takes about 13GB.
Anyway, while these datacenter servers can deliver these speeds for a single session, they don’t do that because large batches result in much higher combined throughput.
> I wonder are you using a quantized version of Mistral?
Yes, we’re comparing phone performance versus datacenter GPUs. That is the discussion point I was responding to originally. That person appeared to be asking when phones are going to be faster than datacenters at running these models. Phones are not running un-quantized 7B models. I was using the 4-bit quantized models, which are close to what phones would be able to run, and a very good balance of accuracy vs speed.
> Anyway, while these datacenter servers can deliver these speeds for a single session, they don’t do that because large batches result in much higher combined throughput.
I don’t agree… batching will increase latency slightly, but it shouldn’t affect throughput for a single session much if it is done correctly. I admit it probably will have some effect, of course. The point of batching is to make use of the unused compute resources, balancing compute vs memory bandwidth better. You should still be running through the layers as fast as memory bandwidth allows, not stalling on compute by making the batch size too large. Right?
We don’t see these speeds because datacenter GPUs are running much larger models, as I have said repeatedly. Even GPT-3.5 Turbo is huge by comparison, since it is believed to be 20B parameters. It would run at about a third of the speed of Mistral. But, GPT-4 is where things get really useful, and no one knows (publicly) just how huge that is. It is definitely a lot slower than GPT-3.5, which in turn is a lot slower than Mistral.
would love to hear what you think: https://testflight.apple.com/join/ERFxInZg