Hacker News new | ask | show | jobs
by energy123 307 days ago
How can you say progress has stalled two weeks after LLMs won gold medals at IOI and IMO?

How can you say progress has stalled without having visibility on the compute costs of gpt-5 relative to o3?

How can you say progress has stalled by referring to changes in benchmarks at the frontier over just 3.5 months?

3 comments

You can't say that with any certainty, but I personally share the impression that growth has not kept up with the hype of 2023. Take the following for example. That's an article from April 2023, that strongly implies that the next version of GPT would be so much more powerful than the current one that it would be dangerous to work on or even release.

Altman specifically used the version number "GPT5" back then. GPT5 is quite good, but is it the kind of technology that requires a word-wide moratorium on its development, lest it make humanity redundant?

"""

(Friedman) asked Altman for his thoughts on the recently released and widely circulated open letter demanding an AI pause. In response, the OpenAI founder shared some of his critiques. “An earlier version of the letter claimed OpenAI is training GPT-5 right now. We are not, and won’t for some time,” Altman noted. “So in that sense, [the letter] was sort of silly.”

But, GPT-5 or not, Altman’s statement isn’t likely to be particularly reassuring to AI’s critiques, as first pointed out in a report from the Verge. The tech founder followed up his “no GPT-5″ announcement by immediately clarifying that upgrades and updates are in the works for GPT-4. There are ways to increase a technologies’ capacity beyond releasing an official, higher-number version of it.

"""

(from: https://gizmodo.com/sam-altman-open-ai-chatbot-gpt4-gpt5-185...)

All that feels like specialized stunts like IBM’s Watson beating Ken Jennings at Jeopardy.

The rate of improvement has slowed significantly. And chasing benchmarks is making everything worse IMO. Opus 4.1 is worse than Sonnet 3.7 to me :/.

I think the future will be:

1. Ads and quantization/routing to chase profits

2. Local models start taking over. New companies will slide in without the huge losses and provide what Claude/OpenAI do today at reasonable margins

3. Apple/Google eat up lots of the market by shipping good-enough models with iOS/Android

My personal test question keeps bombing, and I think it's something they should be capable of doing?

Are those math contests? Are their questions and answers in the training set?

Let's say that these things really won a math Olympiad by thinking. Ok, I would like it to to write parsers based on a well defined expression or language spec. Not as bad as near unparseable C++ or JavaScript.

The AIs refuse, despite the prompt, to write a complete parser, hallucinate tests, do things like just call the already working compiler on the CLI, force repetitive reprompts that still won't complete the task.

To me, this is a good example of a task I would give AI as a service to see if it will reliably do something that's well specified, moderately annoying, and is most definitely in the training set if they are pulling data from "the internet".

> My personal test question keeps bombing, and I think it's something they should be capable of doing?

The problem is that "they" isn't a monolith. How much compute went into your tests? Gpt-5 thinking in ChatGPT Plus uses less compute than Gpt-5 thinking in ChatGPT Pro, which uses less compute than the "high" reasoning effort when "gpt-5" is called via the API, which uses less compute than Gpt-5 Pro in ChatGPT Pro, which uses less compute than custom scaffolds, which uses less compute than what went into the IMO/IOI solutions. This is not just my idle speculation, it's publicly available information.