| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dystnitem4r3 592 days ago

As someone who has been running llama.cpp for 2-3 years now (I started with RWKV v3 on python, one of the previous most accessible models due to both cpu and gpu support and the ability to run on older small GPUs, even Kepler era 2GB cards!), I felt the need to point out that only needing llama.cpp binaries and only being 5MB is ONLY true for cpu inference using pre-converted/quantized models. If you are getting a raw trained model from Meta, RWKV, THUD, Bytedance, Microsoft, Alibaba, or any of the big companies releasing open weight (but generally not open source) models to the public, they WILL require python, torch, and dozens to hundreds of prerequisite python modules in order to run the convert.py script to produce an output model.

Should you wish to convert a model yourself, make sure you use BF16 (exceptions apply for natively trained models in FP32, FP16, and 1/1.58 bit native formats) for the majority of the models you convert if you have enough disk space then run llama-quantize on that model to create any quantized models to minimize conversion losses and allow the accuracy vs. performance vs. space considerations that make the most sense for you.

As far as models go, Mistral-2-Large, GLM-4 variants, Mistral-Nemo-8B are my current non-multimodal favorites. llama.cpp doesn't currently support multimodal models unless you use one of the various forks using it as the inference backend due to issues embedding the image tokens in the llama-server implementation. The three models listed have most recently given the most personality when asked to play Colossus (M2L), the best translation between multiple languages while maintaining consistency between translations (GLM-4), and the most obscure code knowledge and annotation capabilities (Mistral-Nemo-8B with CodeGeeX-4-9B, a GLM-4 finetune as a close second). The last two models both were able to answer questions on 16 bit DOS C programming, near and far pointers, and even give assembly examples, although you have to specify very carefully to only emit 8086 or pre-80386 assembly mnemonics to avoid them using e?x variants of ?x registers.

May this comment prove illuminating for one searching for light.