| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ericskiff 584 days ago

What we can reasonably assume from statements made by insiders:

They want a 10x improvement from scaling and a 10x improvement from data and algorithmic changes

The sources of public data are essentially tapped

Algorithmic changes will be an unknown to us until they release, but from published research this remains a steady source of improvement

Scaling seems to stall if data is limited

So with all of that taken together, the logical step is to figure out how to turn compute into better data to train on. Enter strawberry / o1, and now o3

They can throw money, time, and compute at thinking about and then generating better training data. If the belief is that N billion new tokens of high quality training data will unlock the leap in capabilities they’re looking for, then it makes sense to delay the training until that dataset is ready

With o3 now public knowledge, imagine how long it’s been churning out new thinking at expert level across every field. OpenAI’s next moat may be the best synthetic training set ever.

At this point I would guess we get 4.5 with a subset of this - some scale improvement, the algorithmic pickups since 4 was trained, and a cleaned and improved core data set but without risking leakage of the superior dataset

When 5 launches, we get to see what a fully scaled version looks like with training data that outstrips average humans in almost every problem space

Then the next o-model gets to start with that as a base and reason? Its likely to be remarkable

7 comments

sdwr 584 days ago

Great improvements and all, but they are still no closer (as of 4o regular) to having a system that can be responsible for work. In math problems, it forgets which variable represents what, in coding questions it invents library fns.

I was watching a YouTube interview with a "trading floor insider". They said they were really being paid for holding risk. The bank has a position in a market, and it's their ass on the line if it tanks.

ChatGPT (as far as I can tell) is no closer to being accountable or responsible for anything it produces. If they don't solve that (and the problem is probably inherent to the architecture), they are, in some sense, polishing a turd.

nightowl_games 583 days ago

> They said they were really being paid for holding risk.

I think that's a really interesting insight that has application to using 'AI' in jobs across the board.

zifpanachr23 583 days ago

This is underdiscussed. I don't think people understand just how worthless AI is in a ton of fields until it is able to be held liable and be sent to prison.

There are a lot of moral conundrums that are just not going to work out with this. Seems like an attempt to just offload liability and it seems like pretty much everybody has caught onto that as being it's main selling point and probably main thing that will keep it from ever being accepted for anything important.

tucnak 584 days ago

> ChatGPT (as far as I can tell) is no closer to being accountable or responsible for anything it produces.

What does it even mean? How do you imagine that? You want OpenAI to take on liability for the kicks of it?

numpad0 584 days ago

If an LLM can't be left to do mowing by itself, but a human will have to closely monitor and intervene at every its steps, then it's just a super fast predictive keyboard, no?

dyauspitr 583 days ago

But what if the human only has to intervene once every 100 hours, that’s a huge productivity boost.

cjblomqvist 583 days ago

The point is you don't know when of those 100 hours that is, so you still need to monitor the full 100 hour time span.

Can still be a boost. But definitely not the same magnitude.

kjkjadksj 583 days ago

And one might also wonder still if we need a general language model to mow the grass or just a simpler solution towards to problem of driving a mower over a fixed property line automatically. Something you could probably solve with wwii era technology, honestly.

dmkolobov 584 days ago

Obviously not. I want legislation which imposes liability on OpenAI and similar companies if they actively market their products for use in safety-critical fields and their product doesn’t perform as advertised.

If a system is providing incorrect medical diagnoses, or denying services to protected classes due to biases in the training in the training data, someone should be held accountable.

sdwr 584 days ago

Personal responsibility, not legal liability. In the way a child can be responsible for a pet.

Chatgpt was trained on benchmarks and user opinions - "throwing **** at the wall to see what sticks".

Responsibility means penalties for making mistakes, and, more importantly, having an awareness of those penalties (that informs its decision-making).

SpicyLemonZest 584 days ago

They would want to, if they thought they could, because doing so would unblock a ton of valuable use cases. A tax preparation or financial advisor AI would do huge numbers for any company able to promise that its advice can be trusted.

Stevvo 584 days ago

"With o3 now public knowledge, imagine how long it’s been churning out new thinking at expert level across every field."

I highly doubt that. o3 is many orders of magnitude more expensive than paying subject matter experts to create new data. It just doesn't make sense to pay six figures in compute to get o3 to make data a human could make for a few hundred dollars.

bookaway 584 days ago

Yes, I think they had to push this reveal forward because their investors were getting antsy with the lack of visible progress to justify continuing rising valuations. There is no other reason a confident company making continuous rapid progress would feel the need to reveal a product that 99% of companies worldwide couldn't use at the time of the reveal.

That being said, if OpenAI is burning cash at lightspeed and doesn't have to publicly reveal the revenue they receive from certain government entities, it wouldn't come as a surprise if they let the government play with it early on in exchange for some much needed cash to set on fire.

EDIT: The fact that multiple sites seem to be publishing GPT-5 stories similar to this one leads one to conclude that the o3 benchmark story was meant to counter the negativity from this and other similar articles that are just coming out.

mrshadowgoose 584 days ago

Can SMEs deliver that data in a meaningful amount of time? Training data now is worth significantly more than data a year from now.

GolfPopper 583 days ago

>churning out new thinking at expert level across every field

I suspect this is really, "churning out text that impresses management".

tshadley 584 days ago

Seems to me o3 prices would be what the consumer pays, not what OpenAI pays. That would mean o3 could be more efficient in-house than paying subject-matter experts.

mrbungie 584 days ago

For every consumer there will be a period where they need both the SME and the o3 model for initial calibration and eventual handoff for actually getting those efficiencies in whichever processes they want to automate.

In other words if you are diligent enough, you should at least validate your o3 solution with an actual expert for some time. You wouldn't just blindly trust OpenAI your business critical processes, would you? I would expect at least 3 month - 6 months for large corps and even more considering change management, re-upskilling, etc.

With all those considerations I really don't see the value prop at those prices and in those situations right now. Maybe if costs decrease ~1-3 orders of magnitude more for o3-low, depending on the the processes being automated.

lalalali 584 days ago

What is open ai margin on that product?

dartos 584 days ago

That’s an interesting idea. What if OpenAI funded medical research initiatives in exchange for exclusive training rights on the research.

onlyrealcuzzo 584 days ago

It would be orders of magnitude cheaper to outsource to humans.

dartos 584 days ago

Not as sexy to investors though

aswegs8 584 days ago

Wait didn't they just recently request researchers to pair up with them in exchange for the data?

DougN7 584 days ago

Someone needs to dress up Mechanical Turk and repackage it as an AI company…..

jitl 584 days ago

That’s basically every AI company that existed before GPT3

rtsil 584 days ago

Unless the quality of the human data are extraordinary, it seems according to the TFA that it's not that easy:

> The process is painfully slow. GPT-4 was trained on an estimated 13 trillion tokens. A thousand people writing 5,000 words a day would take months to produce a billion tokens.

And if the human-generated data was so qualitatively good that it is smaller by three order of magnitudes, than I can assume it would be at least as expensive as o3.

az226 583 days ago

Only a matter of time. The costs are aggressively going down. And with specialized inference hardware it will go further down.

Cost of coordination is also large. Immediate answers are an advantage/selling point.

nialv7 584 days ago

> OpenAI’s next moat

I don't think oai has any moat at all. If you look around, QwQ from Alibaba is already pushing o1-preview performances. I think oai is only ahead by 3~6 months at most.

vasco 584 days ago

If their AGI dreams would come true it might be more than enough to have 3 months head start. They probably won't, but it's interesting to ponder what the next few hours, days, weeks would be for someone that would wield AGI.

Like let's say you have a few datacenters of compute at your disposal and the ability to instantiate millions of AGI agents - what do you have them do?

I wonder if the USA already has a secret program for this under national defense. But it is interesting that once you do control an actual AGI you'd want to speed-run a bunch of things. In opposition to that, how do you detect an adversary already has / is using it and what to do in that case.

kevingadd 584 days ago

How many important problems are there where a 3 month head start on the data side is enough to win permanently and retain your advantage in the long run?

I'm struggling to think of a scenario where "I have AGI in January and everyone else has it in April" is life-changing. It's a win, for sure, and it's an advantage, but success in business requires sustainable growth and manageable costs.

If (random example) the bargain OpenAI strikes is "we spend every cent of our available capital to get AGI 3 months before the other guys do" they've now tapped all the resources they would need to leverage AGI and turn it into profitable, scalable businesses, while the other guys can take it slow and arrive with full pockets. I don't think their leadership is stupid enough to burn all their resources chasing AGI but it does seem like operating and training costs are an ongoing problem for them.

History is littered with first-movers who came up with something first and then failed to execute on it, only for someone else to follow up and actually turn the idea into a success. I don't see any reason to assume that the "first AGI" is going to be the only successful AGI on the market, or even a success at all. Even if you've developed an AGI that can change the world you need to keep it running so it can do that.

Consider it this way: Sam Altman & his ilk have been talking up how dangerous OpenAI's technology is. Are risk-averse businessmen and politicians going to be lining up to put their livelihood or even their lives in the hands of "dangerous technology"? Or are they going to wait 3-6 months and adopt the "safe" AGI from somebody else instead?

vasco 584 days ago

Well that's the thought exercise. Is there something you can do with almost unlimited "brains" of roughly human capability but much faster, within a few days / weeks / months. Lets say you can instantiate 1 million agents, for 3 months, and each of them is roughly 100x faster than a human, that means you have the equivalent of 100 million human-brain-hours to dump into whatever you want, as long as your plans don't require building too many real world things that actually require moving atoms around, I think you could do some interesting things. You could potentially dump a few million hours into "better than AGI AI" to start off for example, then go to other things. If they are good enough you might be able to find enough zero-days to disable any adversary through software, among other interesting things.

kevingadd 584 days ago

Where does "almost unlimited" come into the picture though? I see people talking like AGI will be unlimited when it will be limited by available compute resources, and like I suggested, being 'first' might come at the cost of the war chest you'd need to access those resources.

What does it take to instantiate 1 million agents? Who has that kind of money and hardware? Would they still have it if they burn everything in the tank to be first?

vasco 583 days ago

> Where does "almost unlimited" come into the picture though

>> Like let's say you have a few datacenters of compute at your disposal and the ability to instantiate millions of AGI agents - what do you have them do?

> has that kind of money and hardware?

Any hyperscaler plus most geopolitical main players. So the ones who matter.

pertymcpert 583 days ago

Once you have AGI you use it to collect resources to cripple competitors and to build a snowball effect to make yourself unbeatable. 3 months of AGI is enough in the right hands to dominate the world economically.

acyou 583 days ago

That is why being #2 in technical product development can be great. Someone else pays to work out the kinks, copy what works and improve on it at a fraction of the cost. You see it time and time again.

dartos 584 days ago

I’m curious how, if at all, the plan to get around compounding bias in synthetic data generated by models trained in synthetic data.

ynniv 584 days ago

Everyone's obsessed with new training tokens... It doesn't need to be more knowledgeable, it just needs to practice more. Ask any student: practice is synthetic data.

dartos 584 days ago

That leads to overfitting in ML land, which hurts overall performance.

We know that unique data improves performance.

These LLM systems are not students…

Also, which students graduate and are immediately experts in their fields? Almost none.

It takes years of practice in unique, often one-off, situations after graduation for most people to develop the intuition needed for a given field.

ynniv 584 days ago

It's overfitting when you train too large a model on too many details. Rote memorization isn't rewarding.

The more concepts the model manages to grok, the more nonlinear its capabilities will be: we don't have a data problem, we have an educational one.

Claude 3.5 was safety trained by Claude 3.0, and it's more coherent for it. https://www.anthropic.com/news/claudes-constitution

dartos 584 days ago

Overfitting can be caused by a lot of different things. Having an over abundance of one kind of data in a training set is one of those causes.

It’s why many pre-processing steps for image training pipelines will add copies of images at weird rotations, amounts of blur, and different cropping.

> The more concepts the model manages to grok, the more nonlinear its capabilities will be

These kind of hand wavey statements like “practice,” “grok,” and “nonlinear its capabilities will be” are not very constructive as they don’t have solid meaning wrt language models.

So earlier when I was referring to compounding bias in synthetic data I was referring to a bias that gets trained on over and over and over again.

That leads to overfitting.

ynniv 584 days ago

These kind of hand wavey statements like “practice,” “grok,” and “nonlinear its capabilities will be” are not very constructive as they don’t have solid meaning wrt language models.

So, here's my hypothesis, as someone who is adjacent ML but haven't trained DNNs directly:

We don't understand how they work, because we didn't build them. They built themselves.

At face value this can be seen as an almost spiritual position, but I am not a religious person and I don't think there's any magic involved. Unlike traditional models, the behavior of DNNs is based on random changes that failed up. We can reason about their structure, but only loosely about their functionality. When they get better at drawing, it isn't because we taught them to draw. When they get better at reasoning, it isn't because the engineers were better philosophers. Given this, there will not be a direct correlation between inputs and capabilities, but some arrangements do work better than others.

If this is the case, high order capabilities should continue to increase with training cycles, as long as they are performed in ways that don't interfere with what has been successfully learned. People lamented the loss of capability that GPT 4 suffered as they increased safety. I think Anthropic has avoided this by choosing a less damaging way to tune a well performing model.

I think these ideas are supported by Wolfram's reduction of the problem at https://writings.stephenwolfram.com/2024/08/whats-really-goi...

layer8 584 days ago

And who will tell the model whether its practice results are correct or not? Students practice against external evaluators, it’s not a self-contained system.

nialv7 584 days ago

synthetic data is fine if you can ground the model somehow. that's why the o1/o3's improvements are mostly in reasoning, maths, etc., because you can easily tell if the data is wrong or not.

dartos 584 days ago

That makes a lot of sense.

Binary success criteria has very little room for bias.

jsheard 584 days ago

> With o3 now public knowledge, imagine how long it’s been churning out new thinking at expert level across every field. OpenAI’s next moat may be the best synthetic training set ever.

Even taking OpenAI and the benchmark authors at their word they said that it is consuming at least tens of dollars per task to hit peak performance, how much would it cost to have it produce a meaningfully large training set?

qup 584 days ago

That's the public API price isn't it?

jsheard 584 days ago

There is no public API for o3 yet, those are the numbers they revealed in the ARC-AGI announcement. Even if they were public API prices we can't assume they're making a profit on those for as long as they're billions in the red overall every year, its entirely possible that the public API prices are less than what OpenAI is actually paying.

noman-land 584 days ago

I completely don't understand the use for synthetic data. What good it's it to train a model basically on itself?

psb217 584 days ago

The value of synthetic data relies on having non-zero signal about which generated data is "better" or "worse". In a sense, this what reinforcement learning is about. Ie, generate some data, have that data scored by some evaluator, and then feed the data back into the model with higher weight on the better stuff and lower weight on the worse stuff.

The basic loop is: (i) generate synthetic data, (ii) rate synthetic data, (iii) update model to put more probability on better data and less probability on worse data, then go back to (i).

RedNifre 584 days ago

But who rates the synthetic data? If it is humans, I can understand that this is another way to get human knowledge into it, but if it's rated by AI, isn't it just a convoluted way of copying the rating AI's knowledge?

recursivecaveat 584 days ago

Many things are more easily scored than produced. Like it's trivial to tell whether a poem rhymes, but writing one is a comparatively slow and difficult task. So hopefully since scoring is easier/more-discerning than generating, the idea is you can generate stuff, classify it as good or bad, and then retrain on the good stuff. It's kindof an article of faith for a lot of AI companies/professionals as well, since it prevents you from having to face a data wall, and is analogous to a human student practicing and learning in an appealing way.

As far as I know it doesn't work very well so far. It is prone to overfitting, where it ranks highly some trivial detail of the output eg "if a summary starts with a byline of the author its a sign of quality" and then starts looping on itself over and over, increasing the frequency and size of bylines until it's totally crommed off to infinity and just repeating a short phrase endlessly. Humans have good baselines and common sense that these ML systems lack, if you've ever seen one of those "deep dream" images it's the same kind of idea. The "most possible dog" image can be looks almost nothing like a dog in the same way that the "most possible poem" may look nothing like a poem.

ijustlovemath 584 days ago

This is the bit I've never understood about training AI on its own output; won't you just regress to the mean?

astrange 583 days ago

It's not trained on its own output. You can generate infinite correctly worked out math traces and train on those.

noman-land 584 days ago

Thanks, that makes a lot more sense.

viraptor 584 days ago

This is a good read for some examples https://arxiv.org/abs/2203.14465

> This technique, the "Self-Taught Reasoner" (STaR), relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; repeat. We show that STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers

But there are a few others. In general good data is good data. We're definitely learning more about how to produce good synthetic version.

im3w1l 584 days ago

One issue with that is that the model may learn to smuggle data. You as a human think that the plain reading of the words is what is doing the reasoning, but (part of) the processing is done by the exact comma placement and synonym choice etc.

Data smuggling is a known phenomenon in similar tasks.

viraptor 583 days ago

I don't think data smuggling is relevant in star style scenarios. You're still validating the final output. If it works on test data, what could be even smuggled.

Majromax 583 days ago

> What good it's it to train a model basically on itself?

If the model generates data of variable quality, and if there's a good way to distinguish good data from bad data, then training on self-generated data might "bootstrap" a model to better performance.

This is common in reinforcement learning. Famously, AlphaGo Zero (https://en.wikipedia.org/wiki/AlphaGo_Zero) learned exclusively on self-play, without reference to human-played games.

Of course, games have a built-in critic: the better strategy usually wins. It's much harder to judge the answer to a math problem, or decide which essay is more persuasive, or evaluate restaurant recommendations.

dyauspitr 583 days ago

If we get to a point where we have a model that when fed a real world stream of data (YouTube, surveillance cameras, forum data, cell phone conversations etc.) and can prune out a good training set for itself then you’re at the point where the LLM is in a feedback loop where it can improve itself. That’s AGI for all intents and purposes.

nradov 584 days ago

There is an enormous "iceberg" of untapped non-public data locked behind paywalls or licensing agreements. The next frontier will be spending money and human effort to get access to that data, then transform it into something useful for training.

mistercheph 583 days ago

ah yes the beautiful iceberg of internal documentation, legal paperwork, and meeting notes.

the highest quality language data that exists is in the public domain