| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gigatexal 61 days ago
	what's the real world comparison to opus 4.7 fellow coders?

1 comments

Sembiance 61 days ago

I gave 4.6, 4.7 and GPT 5.5 the same prompt and task to reverse engineer a collection of sample vector files from an obscure Amiga CAD program and create a detailed txt specification and a python converter that converts to SVG and produce a report so I can visually verify.

4.6 did very well. 90% perfect on first try, got to 100% with just a few followups. 4.7 failed horribly. First produced garbage output and claimed it was done, admitted it did that when called out, proceeded to work at it a lot longer and then IT GAVE UP. GPT 5.5 codex was shockingly good. Achieved 90% perfect on first try in about a fourth of the time. Got to 100% faster and with fewer follow-ups.

I’m impressed.

link

gigatexal 61 days ago

Interesting that 4.7 failed like that. Seems 5.5 is impressive but is oh so expensive.

Would be interesting if you ran your same test with Deepseek v4 and some of the other Chinese models.

link

Sembiance 61 days ago

Just tried with DeepSeek V4 Pro with OpenCode. It didn't do great. First attempt produced somewhat correct drawings for some of the original samples, but most were just a spaghetti messs of lines. Some prodding got it to do a little better, but still not right. A third prod and it went down a wild rabbit hole and was much worse. I gave up.

I also tried GLM 5.1, it's first attempt was such a disaster I didn't bother working with it any further. It also took by far the longest and wasted a bunch of time/tokens trying to find other converters online (and failing) instead of just reverse engineering the format from the sample files given.

link

gigatexal 60 days ago

Interesting. I would love your test but for code. If I were to forgo my claude subscription for a Chinese cloud hosted model or local models running on my own hardware I'd use them mostly for code.

the thing is I've tried to come up with a good test my own and spend countless time just tweaking it instead of saying this is good enough and benchmarking.

link