| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by johnfn 232 days ago
	Literally yesterday we had a post about GPT-5.2, which jumped 30% on ARC-AGI 2, 100% on AIME without tools, and a bunch of other impressive stats. A layman's (mine) reading of those numbers feels like the models continue to improve as fast as they always have. Then today we have people saying every iteration is further from AGI. It really perplexes me is how split-brain HN is on this topic.

5 comments

qouteall 232 days ago

Goodhart's law: When a measure becomes a target, it ceases to be a good measure.

AI companies have high incentive to make score go up. They may employ human to write similar-to-benchmark training data to hack benchmark (while not directly train on test).

Throwing your hard problem at work to LLM is a better metric than benchmarks.

link

idopmstuff 232 days ago

I own a business and am constantly using working on using AI in every part of it, both for actual time savings and also as my very practical eval. On the "can this successfully be used to do work that I do or pay someone else to do more quickly/cheaply/etc." eval, I can confirm that models are progressing nicely!

link

unaesoj 232 days ago

I work in construction. Gpt-5.2 is the first model that has been able to make a quantity takeoff for concrete and rebar from a set of drawings. I've been testing this since o1.

link

vlovich123 232 days ago

One classic problem in all ML is ensuring the benchmark is representative and that the algorithm isn’t overfitting the benchmark.

This remains an open problem for LLMs - we don’t have true AGI benchmarks and the LLMs are frequently learning the benchmark problems without actually necessarily getting that much better in real world. Gemini 3 has been hailed precisely because it’s delivered huge gains across the board that aren’t overfitting to benchmarks.

link

ipaddr 232 days ago

This could be a solved problem. Come up with problems not online and compare. Later use LLMs to sort through your problems and classify between easy-difficult

link

vlovich123 232 days ago

Hard to do for an industry benchmark since doing the test in such a mode requires sending the question to the LLM which then basically puts it into a public training set.

This has been tried multiple times by multiple people and it ends up not doing so great over time in terms of retaining immunity to “cheating”.

link

kalkin 232 days ago

How do you imagine existing benchmarks were created?

link

FuckButtons 232 days ago

HN is not an entity with a single perspective, and there are plenty of people on here who have a financial stake in you believing their perspective on the matter.

link

rester324 232 days ago

My honest question, isn't simonw one of those people? It feels that way to me

link

simonw 232 days ago

You mean having a financial stake?

Not really. I have a set of disclosures on my blog here: https://simonwillison.net/about/#disclosures

I'm beginning to pick up a few more consulting opportunities based on my writing and my revenue from GitHub sponsors is healthy, but I'm not particularly financially invested in the success of AI as a product category.

link

rester324 231 days ago

Thanks for the link. I see that you get credits and access to embargod releases. So I understand that's not financial stake, but seems enough of an incentive to say positive things about those services, doesn't it? Not that it matters to me, and I might be wrong, but to an outsider it might seem so

link

simonw 231 days ago

Yeah it is, that's why I disclose this stuff.

The counter-incentive here is that my reputation and credibility is more valuable to me than early access to models.

This very post is an example of me taking a risk of annoying a company that I cover. I'm exposing the existence of the ChatGPT skills mechanism here (which I found out about from a tip on Twitter - it's not something I got given early access to via an NDA).

It's very possible OpenAI didn't want that story out there yet and aren't happy that it's sat at the top of Hacker News right now.

link

yojat661 232 days ago

Of course he is

link

noitpmeder 232 days ago

Just because they're better at writing CS algorithms doesn't mean they're taking steps closer to anything resembling AGI.

link

p1esk 232 days ago

Unless AGI is just a bunch of CS algorithms.

link

airstrike 232 days ago