Hacker News new | ask | show | jobs
by coder543 899 days ago
EDIT: Attempting to converse with any Q4_K_M 7B parameter model on a 15 Pro Max... the phone just melts down. It feels like it is producing about one token per minute. MLC-Chat can handle 7B parameter models just fine even on a 14 Pro Max, which has less RAM, so I think there is an issue here.

EDIT 2: Even using StableLM, I am experiencing a total crash of the app fairly consistently if I chat in one conversation, then start a new conversation and try to chat in that. On a related note, since chat history is saved... I don't think it's necessary to have a confirmation prompt if the user clicks the "new chat" shortcut in the top right of a chat.

-----

That does seem much nicer than MLC Chat. I really like the selection of models and saving of conversations.

It looks like you’re still using the old version of TinyLlama. The 1.0 release is out now: https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGU...

Microsoft recently re-licensed Phi-2 to be MIT instead of non-commercial, so I would love to see that in the list of models. Similarly, there is a Dolphin-Phi fine tune.

The topic of discussion here is Mistral-7B v0.2, which is also missing from the model list, unfortunately. There are a few Mistral fine tunes in the list, but obviously not the same thing.

I also wish I could enable performance metrics to see how many tokens/sec the model was running at after each message, and to see how much RAM is being used.

On the whole, this app seems really nice!

1 comments

Wow, thanks so much for taking the time to test it out and share such great feedback!

Thrilled about all those developments! More model options as well as link-based GGUF downloads on the way.

On the 7b models: I’m very sorry for the poor experience. I wouldn’t recommend 7b over Q2_K at the moment, unless you’re on a 16GB iPad (or an Apple Silicon Mac!). This needs to be much clearer, as you observed the consequences can be severe. The larger models, and even 3b Q6_K can be crash prone due to memory pressure. Will work on improve handling of low level out-of-memory errors very soon.

Will also investigate the StableLM crashes, I’m sorry about that! Hopefully Testflight recorded a trace. Just speculating, it may be a similar issue to the larger models, due to the higher-fidelity quant (Q6_K) combined with the context length eventually running out of RAM. Could you give the Q4_K_M a shot? I heard something similar from a friend yesterday, I’m curious if you have a better time with that — perhaps that’s a more sensible default.

Re: the overly-protective new chat alert, I agree, thanks for the suggestion. I’ll incorporate that into the next build. Can I credit you? Let me know how you’d like for me to refer to you, and I’d be happy to.

Finally, please feel free to email me any further feedback, and thanks again for your time and consideration!

britt [at] bl3 [dot] dev

I just checked and MLC Chat is running the 3-bit quantized version of Mistral-7B. It works fine on the 14 Pro Max (6GB RAM) without crashing, and is able to stay resident in memory on the 15 Pro Max (8GB RAM) when switching with another not-too-heavy app. 2-bit quantization just feels like a step too far, but I’ll give it a try.

Regarding credit, I definitely don’t need any. Just happy to see someone working on a better LLM app!

FYI, just submitted a new update for review with a few small but hopefully noticeable changes, thanks in no small part to your feedback:

1. StableLM Zephyr 3b Q4_K_M is now the built-in model, replacing the Q6_K variant.

2. More aggressive RAM headroom calculation, with forced fallback to CPU rather than failing to load or crashing.

3. New status indicator for Metal when model is loaded (filled bolt for enabled, vs slashed bolt for disabled.)

4. Metal will now also be enabled for devices with 4GB RAM or less, but only when the selected model can comfortably fit in RAM. Previously, only devices with at least 6GB had Metal enabled.

Thank you so much again for your time!

The fallback does seem to work! Although the 4-bit 7B models only run at 1 token every several seconds.

I still wish Phi-2, Dolphin Phi-2, and TinyLlama-Chat-v1.0 were available, but I understand you have plans to make it easier to download any model in the future.

4-bit StableLM and 2-bit 7B models do seem to be working more consistently.
That’s great to hear. I’m sorry again about that poor experience, and please do reach out if you have any other feedback!

Britt