Hacker News new | ask | show | jobs
by brittlewis12 897 days ago
Wow, thanks so much for taking the time to test it out and share such great feedback!

Thrilled about all those developments! More model options as well as link-based GGUF downloads on the way.

On the 7b models: I’m very sorry for the poor experience. I wouldn’t recommend 7b over Q2_K at the moment, unless you’re on a 16GB iPad (or an Apple Silicon Mac!). This needs to be much clearer, as you observed the consequences can be severe. The larger models, and even 3b Q6_K can be crash prone due to memory pressure. Will work on improve handling of low level out-of-memory errors very soon.

Will also investigate the StableLM crashes, I’m sorry about that! Hopefully Testflight recorded a trace. Just speculating, it may be a similar issue to the larger models, due to the higher-fidelity quant (Q6_K) combined with the context length eventually running out of RAM. Could you give the Q4_K_M a shot? I heard something similar from a friend yesterday, I’m curious if you have a better time with that — perhaps that’s a more sensible default.

Re: the overly-protective new chat alert, I agree, thanks for the suggestion. I’ll incorporate that into the next build. Can I credit you? Let me know how you’d like for me to refer to you, and I’d be happy to.

Finally, please feel free to email me any further feedback, and thanks again for your time and consideration!

britt [at] bl3 [dot] dev

2 comments

I just checked and MLC Chat is running the 3-bit quantized version of Mistral-7B. It works fine on the 14 Pro Max (6GB RAM) without crashing, and is able to stay resident in memory on the 15 Pro Max (8GB RAM) when switching with another not-too-heavy app. 2-bit quantization just feels like a step too far, but I’ll give it a try.

Regarding credit, I definitely don’t need any. Just happy to see someone working on a better LLM app!

FYI, just submitted a new update for review with a few small but hopefully noticeable changes, thanks in no small part to your feedback:

1. StableLM Zephyr 3b Q4_K_M is now the built-in model, replacing the Q6_K variant.

2. More aggressive RAM headroom calculation, with forced fallback to CPU rather than failing to load or crashing.

3. New status indicator for Metal when model is loaded (filled bolt for enabled, vs slashed bolt for disabled.)

4. Metal will now also be enabled for devices with 4GB RAM or less, but only when the selected model can comfortably fit in RAM. Previously, only devices with at least 6GB had Metal enabled.

Thank you so much again for your time!

The fallback does seem to work! Although the 4-bit 7B models only run at 1 token every several seconds.

I still wish Phi-2, Dolphin Phi-2, and TinyLlama-Chat-v1.0 were available, but I understand you have plans to make it easier to download any model in the future.

4-bit StableLM and 2-bit 7B models do seem to be working more consistently.
That’s great to hear. I’m sorry again about that poor experience, and please do reach out if you have any other feedback!

Britt