| EDIT: Attempting to converse with any Q4_K_M 7B parameter model on a 15 Pro Max... the phone just melts down. It feels like it is producing about one token per minute. MLC-Chat can handle 7B parameter models just fine even on a 14 Pro Max, which has less RAM, so I think there is an issue here. EDIT 2: Even using StableLM, I am experiencing a total crash of the app fairly consistently if I chat in one conversation, then start a new conversation and try to chat in that. On a related note, since chat history is saved... I don't think it's necessary to have a confirmation prompt if the user clicks the "new chat" shortcut in the top right of a chat. ----- That does seem much nicer than MLC Chat. I really like the selection of models and saving of conversations. It looks like you’re still using the old version of TinyLlama. The 1.0 release is out now: https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGU... Microsoft recently re-licensed Phi-2 to be MIT instead of non-commercial, so I would love to see that in the list of models. Similarly, there is a Dolphin-Phi fine tune. The topic of discussion here is Mistral-7B v0.2, which is also missing from the model list, unfortunately. There are a few Mistral fine tunes in the list, but obviously not the same thing. I also wish I could enable performance metrics to see how many tokens/sec the model was running at after each message, and to see how much RAM is being used. On the whole, this app seems really nice! |
Thrilled about all those developments! More model options as well as link-based GGUF downloads on the way.
On the 7b models: I’m very sorry for the poor experience. I wouldn’t recommend 7b over Q2_K at the moment, unless you’re on a 16GB iPad (or an Apple Silicon Mac!). This needs to be much clearer, as you observed the consequences can be severe. The larger models, and even 3b Q6_K can be crash prone due to memory pressure. Will work on improve handling of low level out-of-memory errors very soon.
Will also investigate the StableLM crashes, I’m sorry about that! Hopefully Testflight recorded a trace. Just speculating, it may be a similar issue to the larger models, due to the higher-fidelity quant (Q6_K) combined with the context length eventually running out of RAM. Could you give the Q4_K_M a shot? I heard something similar from a friend yesterday, I’m curious if you have a better time with that — perhaps that’s a more sensible default.
Re: the overly-protective new chat alert, I agree, thanks for the suggestion. I’ll incorporate that into the next build. Can I credit you? Let me know how you’d like for me to refer to you, and I’d be happy to.
Finally, please feel free to email me any further feedback, and thanks again for your time and consideration!
britt [at] bl3 [dot] dev