|
|
|
|
|
by brabel
306 days ago
|
|
For quality to be comparable, you need to use a relatively big model, which will only work if you have around 64GB of RAM or more. The latest OpenAI local models (https://openai.com/index/introducing-gpt-oss/), for example, are really good, but you probably want the 120b to have results at least near what you get with their best cloud models, and that requires I think 80GB+. If you don't have that much, you can try stuff like the DeepSeek models, which are known for being ultra-efficient and runnable with "normal" computers, if you don't mind the politics of using that (and there are many models now that are similar!) but I haven't tried too many more to be able to comment. On my Macbook M1 Pro I can run the gpt-oss-20b model without issues and quite fast. |
|
That said Qwen3 and Qwen3 Coder are both pretty nice. Also ERNIE 4.5 if the benchmarks are to be trusted but I mostly run Ollama instead of vLLM now so can’t test it out atm (apparently llama.cpp added support for them recently though).
The models by Mistral might also be worth a look and personally I thought the EuroLLM project was also nice, but MoE models feel way more palatable on limited hardware.
Neither seem to be able to directly compete with Sonnet 4 or Gemini 2.5 Pro, would need way better hardware to come close.