|
|
|
|
|
by bigstrat2003
198 days ago
|
|
> What I never understand is the population of coders that don’t see any value in coding agents or are aggressively against them, or people that deride LLMs as failing to be able to do X (or hallucinate etc) and are therefore useless and every thing is AI Slop, without recognizing that what we can do today is almost unrecognizeable from the world of 3 years ago. I don't recognize that because it isn't true. I try the LLMs every now and then, and they still make the same stupid hallucinations that ChatGPT did on day 1. AI hype proponents love to make claims that the tech has improved a ton, but based on my experience trying to use it those claims are completely baseless. |
|
One of the tests I sometimes do of LLMs is a geometry puzzle:
They all used to get this wrong all the time. Now the best ones sometimes don't. (That said, only one to succed just as I write this comment was DeepSeek; the first I saw succeed was one of ChatGPT's models but that's now back to the usual error they all used to make).Anecdotes are of course a bad way to study this kind of thing.
Unfortunately, so are the benchmarks, because the models have quickly saturated most of them, including traditional IQ tests (on the plus side, this has demonstrated that IQ tests are definitely a learnable skill, as LLMs loose 40-50 IQ points when going from public to private IQ tests) and stuff like the maths olympiad.
Right now, AFAICT the only open benchmarks are the METR time horizon metric, the ARC-AGI family of tests, and the "make me an SVG of ${…}" stuff inspired by Simon Willison's pelican on a bike.