|
|
|
|
|
by Fergusonb
513 days ago
|
|
These benchmarks have even the small models absolutely demolishing Sonnet-3.5, which doesn't reflect my subjective experience. It still seems to me that these models are 'dumb' and often don't understand what I'm asking, where claude's intuition is much stronger. I feel r1 14b even feels weaker than qwen 2.5 14b Primary use-case is web technology / coding. Maybe I'm prompting it incorrectly? |
|
O1 or even O3 might be able to crack academic level math problems, but I still wouldn't trust it to correctly fill out a McDonalds application using a PDF of my resume and a calendar of my availability.