|
|
|
|
|
by granitepail
317 days ago
|
|
While the benchmarks all say open source models Kimi and Qwen outpace proprietary models like GPT 4.1, GPT 4o, or even o3, my (and just about everyone I know's) boots on the ground experience suggests they're not even close. This is for tool calling agentic tasks, like coding, but also in other contexts (research, glue between services, etc). I feel like it's worth putting that out there--it's pretty clear there's a lot of benchmark hacking happening. I'm not really convinced it's purposeful/deceitful, but it's definitely happening. Qwen3 Coder, for example, is basically incompetent for any real coding tasks and frequently gets caught in death spirals of bad tool calls. I try all the OSS models regularly, because I'm really excited for them to get better. Right now Kimi K2 is the most usable one, and I'd rate it at a few ticks worse than GPT 4.1. |
|
i have an m4 studio with a lot of unified memory and i’m still no where near running a 120b model. i’m at like 30b
apple or nvidia’s going to have to sell 1.5 tb ram machines before benchmark performance is going to be comparable
Plus when you use claude or openai, these days it’s performing google searches etc that my local model isn’t doing.