|
|
|
|
|
by danenania
332 days ago
|
|
Check out the DeepSeek paper. Research/benchmarks aside, try giving a somewhat hard programming task to Opus 4 with reasoning off vs. on. Similarly, try the same with o3 vs. o3-pro (o3-pro reasons for much longer). I'm not going to dig through my history for specific examples, but I do these kinds of comparisons occasionally when coding, and it's not unusual to have e.g. a bug that o3 can't figure out, but o3-pro can. I think this is widely accepted by engineers using LLMs to help them code; it's not controversial. |
|
I wouldn't trust comparing two different models, even from the same provider and family, since there could be many reasons for the performance to be different. Their system prompts, training data, context size, or runtime parameters could be different. Even the same model with the same prompt could have varying performance. So it's difficult to get a clear indication that the reasoning steps are the only changing variable.
But toggling it on the same model would be a more reliable way to test this, so I'll try that, thanks.