| Qwen3 is substantially better in my local testing. As in, adheres to the prompt better (pretty much exactly for the 32B parameter variant, very impressive) and is more organic sounding. In simplebench gpt-oss (120 bn) flopped hard so it doesn't appear particularly good at logical puzzles either. So presumably, this comes down to... - training technique or data - dimension - lower number of large experts vs higher number of small experts |