I gave 4.6, 4.7 and GPT 5.5 the same prompt and task to reverse engineer a collection of sample vector files from an obscure Amiga CAD program and create a detailed txt specification and a python converter that converts to SVG and produce a report so I can visually verify.
4.6 did very well. 90% perfect on first try, got to 100% with just a few followups.
4.7 failed horribly. First produced garbage output and claimed it was done, admitted it did that when called out, proceeded to work at it a lot longer and then IT GAVE UP.
GPT 5.5 codex was shockingly good. Achieved 90% perfect on first try in about a fourth of the time. Got to 100% faster and with fewer follow-ups.
Just tried with DeepSeek V4 Pro with OpenCode. It didn't do great. First attempt produced somewhat correct drawings for some of the original samples, but most were just a spaghetti messs of lines. Some prodding got it to do a little better, but still not right. A third prod and it went down a wild rabbit hole and was much worse. I gave up.
I also tried GLM 5.1, it's first attempt was such a disaster I didn't bother working with it any further. It also took by far the longest and wasted a bunch of time/tokens trying to find other converters online (and failing) instead of just reverse engineering the format from the sample files given.
Interesting. I would love your test but for code. If I were to forgo my claude subscription for a Chinese cloud hosted model or local models running on my own hardware I'd use them mostly for code.
the thing is I've tried to come up with a good test my own and spend countless time just tweaking it instead of saying this is good enough and benchmarking.
4.6 did very well. 90% perfect on first try, got to 100% with just a few followups. 4.7 failed horribly. First produced garbage output and claimed it was done, admitted it did that when called out, proceeded to work at it a lot longer and then IT GAVE UP. GPT 5.5 codex was shockingly good. Achieved 90% perfect on first try in about a fourth of the time. Got to 100% faster and with fewer follow-ups.
I’m impressed.