| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bluecoconut 541 days ago

Efficiency is now key.

~=$3400 per single task to meet human performance on this benchmark is a lot. Also it shows the bullets as "ARC-AGI-TUNED", which makes me think they did some undisclosed amount of fine-tuning (eg. via the API they showed off last week), so even more compute went into this task.

We can compare this roughly to a human doing ARC-AGI puzzles, where a human will take (high variance in my subjective experience) between 5 second and 5 minutes to solve the task. (So i'd argue a human is at 0.03USD - 1.67USD per puzzle at 20USD/hr, and they include in their document an average mechancal turker at $2 USD task in their document)

Going the other direction: I am interpreting this result as human level reasoning now costs (approximately) 41k/hr to 2.5M/hr with current compute.

Super exciting that OpenAI pushed the compute out this far so we could see he O-series scaling continue and intersect humans on ARC, now we get to work towards making this economical!

11 comments

bluecoconut 541 days ago

some other imporant quotes: "Average human off the street: 70-80%. STEM college grad: >95%. Panel of 10 random humans: 99-100%" -@fchollet on X

So, considering that the $3400/task system isn't able to compete with STEM college grad yet, we still have some room (but it is shrinking, i expect even more compute will be thrown and we'll see these barriers broken in coming years)

Also, some other back of envelope calculations:

The gap in cost is roughly 10^3 between O3 High and Avg. mechanical turkers (humans). Via Pure GPU cost improvement (~doubling every 2-2.5 years) puts us at 20~25 years.

The question is now, can we close this "to human" gap (10^3) quickly with algorithms, or are we stuck waiting for the 20-25 years for GPU improvements. (I think it feels obvious: this is new technology, things are moving fast, the chance for algorithmic innovation here is high!)

I also personally think that we need to adjust our efficiency priors, and start looking not at "humans" as the bar to beat, but theoretical computatble limits (show gaps much larger ~10^9-10^15 for modest problems). Though, it may simply be the case that tool/code use + AGI at near human cost covers a lot of that gap.

miki123211 541 days ago

It's also worth keeping in mind that AIs are a lot less risky to deploy for businesses than humans.

You can scale them up and down at any time, they can work 24/7 (including holidays) with no overtime pay and no breaks, they need no corporate campuses, office space, HR personnel or travel budgets, you don't have to worry about key employees going on sick/maternity leave or taking time off the moment they're needed most, they won't assault a coworker, sue for discrimination or secretly turn out to be a pedophile and tarnish the reputation of your company, they won't leak internal documents to the press or rage quit because of new company policies, they won't even stop working when a pandemic stops most of the world from running.

fsndz 541 days ago

I get the excitement, but folks, this is a model that excels only in things like software engineering/math. They basically used reinforcement learning to train the model to better remember which pattern to use to solve specific problems. This in no way generalises to open ended tasks in a way that makes human in the loop unnecessary. This basically makes assistants better (as soon as they figure out how to make it cheaper), but I wouldn't blindly trust the output of o3. Sam Altman is still wrong: https://www.lycee.ai/blog/why-sam-altman-is-wrong

robwwilliams 541 days ago

In your blog you say:

> deep learning doesn't allow models to generalize properly to out-of-distribution data—and that is precisely what we need to build artificial general intelligence.

I think even (or especially) people like Altman accept this as a fact. I do. Hassabis has been saying this for years.

The foundational models are just a foundation. Now start building the AGI superstructure.

And this is also where most of the still human intellectual energy is now.

dartos 540 days ago

You lost me at the end there.

These statistical models don’t generalize well to out of distribution data. If you accept that as a fact, then you must accept that these statistical models are not the path to AGI.

girvo 541 days ago

Quite. And if it was right, those businesses deploying it and replacing humans need humans with jobs and money to pay for their products and services…

fakedang 540 days ago

It will just keep bleeding the middle class on and on, till the point where either everyone is rich, homeless or a plumber or other such licensed worker. And then there will be such a glut in the latter (shrinking) market, that everyone in that group also becomes either rich or homeless.

palmfacehn 540 days ago

Productivity gains increase the standard of living for everyone. Products and services become cheaper. Leisure time increases. Scarce labor resources can be applied in other areas.

I fail to see the difference between AI-employment-doom and other flavors of Luddism.

szundi 540 days ago

Never happened with neither big technology advancement

rockskon 541 days ago

AI has a different risk profile than humans. They are a lot more risky for business operations where failure is wholly unacceptable under any circumstance.

They're risky in that they fail in ways that aren't readily deterministic.

And would you trust your life to a self-driving car in New York City traffic?

miki123211 541 days ago

This is a really hard and weird ethical problem IMHO, and one we'll have to deal with sooner or later.

Imagine you have a self-driving AI that causes fatal accidents 10 times less often than your average human driver, but when the accidents happen, nobody knows why.

Should we switch to that AI, and have 10 times fewer accidents and no accountability for the accidents that do happen, or should we stay with humans, have 10x more road fatalities, but stay happy because the perpetrators end up in prison?

Framed like that, it seems like the former solution is the only acceptable one, yet people call for CEOs to go to prison when an AI goes wrong. If that were the case, companies wouldn't dare use any AI, and that would basically degenerate to the latter solution.

moritzwarhier 541 days ago

I don't know about your country, but people going to prison for causing road fatalities is extremely rare here.

Even temporary loss of the drivers license has a very high bar, and that's the main form of accountability for driver behavior in Germany, apart from fines.

Badly injuring or killing someone who themselves did not violate traffic safety regulations is far from guaranteed to cause severe repercussions for the driver.

By default, any such situation is an accident and at best people lose their license for a couple of months.

paulryanrogers 541 days ago

Drivers are the apex predators. My local BMV passed me after I badly failed the vision test. Thankfully I was shaken enough to immediately go to the eye doctor and get treatment.

chefandy 541 days ago

Sadly, we live in a society where those executives would use that impunity as carte blanche to spend no money improving (in the best-case scenario,) or even more likely, keep cutting safety expenditures until the body counts get high enough for it to start damaging sales. If we’ve already given them a free pass, they will exploit it to the greatest possible extent to increase profit.

ETH_start 540 days ago

What evidence exists for this characterization?

ajmurmann 540 days ago

Like with Cruise. One freak accident and they practically decided to go out of business. Oh wait...

monkeynotes 541 days ago

> nobody knows why

But we do know the culpability rests on the shoulders of the humans who decided the tech was ready for work.

ethbr1 540 days ago

Hey look, it's almost like we're back at the end of the First Industrial Revolution (~1850), as society grapples with how to create happiness in a rapidly shifting economy of supply and demand, especially for labor. https://en.m.wikipedia.org/wiki/Utilitarianism#John_Stuart_M...

Pretty bloody time for labor though. https://en.m.wikipedia.org/wiki/Haymarket_affair

okasaki 541 days ago

Wait, why would we want 10x more traffic fatalities?

stavros 541 days ago

We wouldn't, that's their point.

ajmurmann 540 days ago

Every statistic I've seen indicated much better accident rates for self-driving cars than human drivers. I've taken Waymo rides in SF and felt perfectly safe. I've taken Lyft and Uber and especially taxi rides where I felt much less safe. So I definitely would take the self-driving car. Just because I don't understand am accident doesn't make it more likely to happen.

The one minor risk I see is the cat being too polite and getting effectively stuck in dense traffic. That's a nuisance though.

Is there something about NYC traffic I'm missing?

aprilthird2021 540 days ago

There's one important part about risk management though. If your Waymo does crash, the company is liable for it, and there's no one to shift the blame onto. If a human driver crashes, that's who you can shift liability onto.

Same with any company that employs AI agents. Sure they can work 24/7, but every mistake they make the company will be liable for (or the AI seller). With humans, their fraud, their cheating, their deception, can all be wiped off the company and onto the individual

ethbr1 540 days ago

The next step is going to be around liability insurance for AI agents.

That's literally the point of liability insurance -- to allow the routine use of technologies that rarely (but catastrophically) fail, by ammortizing risk over time / population.

ajmurmann 540 days ago

Sure, that's unrelated though to the question which was if one would feel comfortable taking a self-driving car in NYC

ijidak 541 days ago

It is amazing to me that we have reached an era where we are debating the trade-off of hiring thinking machines!

I mean, this is an incredible moment from that standpoint.

Regarding the topic at hand, I think that there will always be room for humans for the reasons you listed.

But even replacing 5% of humans with AI's will have mind boggling consequences.

I think you're right that there are jobs that humans will be preferred for for quite some time.

But, I'm already using AI with success where I would previously hire a human, and this is in this primitive stage.

With the leaps we are seeing, AI is coming for jobs.

Your concerns relate to exactly how many jobs.

And only time will tell.

But, I think some meaningful percentage of the population -- even if just 5% of humanity will be replaced by AI.

lxgr 541 days ago

Isn't everybody in NYC already? (The dangers of bad driving are much higher for pedestrians than for people in cars; there are more of the former than of the latter in NYC; I'd expect there to be a non-zero number of fully self driving cars already in the city.)

rockskon 541 days ago

That doesn't answer my question.

9dev 541 days ago

It does, in a way; AI is already there, all around you, whether you like it or not. Technological progress is Pandora’s box; you can’t take it back or slow it down. Businesses will use AI for critical workflows, and all good that they bring, and all bad too, will happen.

chefandy 541 days ago

If there are any fully-autonomous cars on the streets of nyc, there aren’t many of them and I don’t think there’s any way for them to operate legally. There has been discussion about having a trial.

MaxPock 541 days ago

It depends with what the risk is .Would it be whole or in part ? In an organisation,failure by an HR might present an isolated departmental risk while an AI might not be the case.

wwweston 541 days ago

We can just insulate businesses employing AI from any liability, problem solved.

9dev 541 days ago

„Well, our AI that was specifically designed for maximising gains above all else may indeed have instructed the workers to cut down the entire Amazonas forest for short-term gains in furniture production.“ But no human was involved in the decision, so nobody is liable and everything is golden? Is that the future you would like to live in?

wwweston 540 days ago

Apparently I need to work on my deadpan delivery.

Or just articulate things openly: we already insulate business owners from liability because we think it tunes investment incentives, and in so doing have created social entities/corporate "persons"/a kind of AI who have different incentives than most human beings but are driving important social decisions. And they've supported some astonishing cooperation which has helped produce things like the infrastructure on which we are having this conversation! But also, we have existing AIs of this kind who are already inclined to cut down the entire Amazonas forest for furnitue production because it maximizes their function.

That's not just the future we live in, that's the world we've been living in for a century or few. On one hand, industrial productivity benefits, on the other hand, it values human life and the ecology we depend on about like any other industrial input. Yet many people in the world's premier (former?) democracy repeat enthusiastic endorsements of this philosophy reducing their personal skin to little more than an industrial input: "run the government like a business."

Unless people change, we are very much on track to create a world where these dynamics (among others) of the human condition are greatly magnified by all kinds of automation technology, including AI. Probably starting with limited liability for AIs and companies employing them, possibly even statutory limits, though it's much more likely that wealthy businesses will simply be insulated with by the sheer resources they have to make sure the courts can't hold them accountable, even where we still have a judicial system that isn't willing to play calvinball for cash or catechism (which, unfortunately, does not seem to include a supreme court majority).

In short, you and I probably agree that liability for AI is important, and limited liability for it isn't good. Perhaps I am too skeptical that we can pull this off, and being optimistic would serve everyone better.

lazide 541 days ago

Hmmm, how much stock do I own in this hypothetical company? (/s, kinda)

fsloth 541 days ago

I guess - yes from business&liability sense? ”This service you are now paying for 100$? We can sell it to you for 5$ but with the caveat _we give no guarantees if it works or is it fit for purpose_ - click here to accept”.

lazide 540 days ago

Haha, they’d just continue selling it for $100 then change the TOS on page 50 to say the same thing.

zelphirkalt 541 days ago

Deterministic they may be, but unforeseeable for humans.

antihipocrat 541 days ago

AI brings similar risks - they can leak internal information, they can be tricked into performing prohibited tasks (with catastrophic effects if this is connected to core systems), they could be accused of actions that are discriminatory (biased training sets are very common).

Sure, if a business deploys it to perform tasks that are inherently low risk e.g. no client interface, no core system connection and low error impact, then the human performing these tasks is going to be replaced.

snozolli 541 days ago

they can be tricked into performing prohibited tasks

This reminds me of the school principal who sent $100k to a scammer claiming to be Elon Musk. The kicker is that she was repeatedly told that it was a scam.

https://abc7chicago.com/fake-elon-musk-jan-mcgee-principal-b...

tstrimple 541 days ago

This is one of the things which annoys me most about anti-LLM hate. Your peers aren't right all the time either. They believe incorrect things and will pursue worse solutions because they won't acknowledge a better way. How is this any different from a LLM? You have to question everything you're presented with. Sometimes that Stack Overflow answer isn't directly applicable to your exact problem but you can extrapolate from it to resolve your problem. Why is an LLM viewed any differently? Of course you can't just blindly accept it as the one true answer, but you literally cannot do that with humans either. Humans produce a ton of shit code and non-solutions and it's fine. But when an LLM does it, it's a serious problem that means the tech is useless. Much of the modern world is built on shit solutions and we still hobble along.

lazide 541 days ago

Everyone knows humans can be idiots. The problem is that people seem to think LLMs can’t be idiots, and because they aren’t human there is no way to punish them. And then people give them too much credit/power, for their own purposes.

Which makes LLMs far more dangerous than idiot humans in most cases.

pineaux 541 days ago

Its quite stunning to frame it as anti-LLM hate. It's on the pro-LLM people to convince the anti-LLM people that choosing for LLMs is an ethically correct choice with all the necessary guardrails. It's also on the pro-LLM people to show the usefulness of the product. If pro-LLM people are right, it will be a matter of time before these people will see the errors of their ways. But doing an ad-hominem is a sure way of creating a divide...

mplewis 541 days ago

Humans can tell you how confident they are in something being right or wrong. An LLM has no internal model and cannot do such a thing.

gf000 541 days ago

But human stupidity, while itself can be sometimes an unknown unknown with its creativity, is a mostly known unknown.

LLMs fail in entirely novel ways you can't even fathom upfront.

TheOtherHobbes 541 days ago

It's all fun and games until the infra crashes and you can't work out why, because a machine has written all of the code, no one understands how it works or what it's doing.

Or - worse - there is no accessible code anywhere, and you have to prompt your way out of "I'm sorry Dave, I can't do that," while nothing works.

And a human-free economy does... what? For whom? When 99% of the population is unemployed, what are the 1% doing while the planet's ecosystems collapse around them?

exhaze 540 days ago

You misunderstand the fundamentals. I've built a type-safe code generation pipeline using TypeScript that enforces compile-time and runtime safety. Everything generates from a single source of truth - structured JSON containing the business logic. The output is deterministic, inspectable, and version controlled.

Your concerns about mysterious AI code and system crashes are backwards. This approach eliminates integration bugs and maintenance issues by design. The generated TypeScript is readable, fully typed, and consistently updated across the entire stack when business logic changes.

If you're struggling with AI-generated code maintainability, that's an implementation problem, not a fundamental issue with code generation. Proper type safety and schema validation create more reliable systems, not less. This is automation making developers more productive - just like compilers and IDEs did - not replacing them.

The code works because it's built on sound software engineering principles: type safety, single source of truth, and deterministic generation. That's verifiable fact, not speculation.

8note 540 days ago

> deterministic generation

what are you using for deterministic generation? the last i heard even with temperature=0 theres non determinism introduced by float uncertainty/approximation

exhaze 540 days ago

Hey, that's a great question. I should have been more clear: for deterministic generation that's not done using an LLM. It's done using just regular execution of TypeScript. The code generators that were created using an LLM and that I manually checked for correctness, they're the ones that are generating the other code - most of the code. So that's where the determinism comes in.

sirsinsalot 541 days ago

It honestly borders on psychopathic the way engineers are treating humans in this context.

People talking like this also, in the back of their minds like to think they'll be OK. They're smart enough to be still needed. They're a human, but they'll be OK even while working to make genAI out perform them at their own work.

I wonder how they'll feel about their own hubris when they struggle to feed their family.

The US can barely make healthcare work without disgusting consequences for the sick. I wonder what mass unemployment looks like.

bnj 541 days ago

For the moment the displacement is asymmetrical; AI replacing employees, but not AI replacing consumers. If AI causes mass unemployment, the pool of consumers (profit to companies) will shrink. I wonder what the ripple effects of that will be.

sirsinsalot 540 days ago

There's no point being rich in a world where the economy is unhealthy.

jvanderbot 541 days ago

It honestly borders on midwit to constantly introduce a false dichotomy of AI vs humans. It's just stupid base animal logic.

There is absolutely no reason a programmer should expect to write code as they do now forever, just as ASM experts had to move on. And there's no reason (no precedent and no indicators) to expect that a well-educated, even-moderately-experienced technologist will suddenly find themselves without a way to feed their family - unless they stubbornly refuse to reskill or change their workflows.

I do believe the days of "everyone makes 100k+" are nearly over, and we're headed towards a severely bimodal distribution, but I do not see how, for the next 10-15 years at least, we can't all become productive building the tools that will obviate our own jobs while we do them - and get comfortably retired in the mean time.

losteric 541 days ago

There is no comfortable retirement if the process of obviating our own jobs is not coupled with appropriate socioeconomic changes.

twh270 540 days ago

Reskill to what? When AI can do software development, it will also be able to do pretty much any other job that requires some learning.

a2800276 541 days ago

But when Sam Altman owns all the money in the world surely he'll distribute some it via his not-for-profit AI company?

lucubratory 541 days ago

>secretly turn out to be a pedophile and tarnish the reputation of your company

This is interesting because it's both Oddly Specific and also something I have seen happen and I still feel really sorry for the company involved. Now that I think about it, I've actually seen it happen twice.

monkeynotes 541 days ago

"AIs are a lot less risky to deploy for businesses than humans" How do you know? LLMs can't even be properly scrutinized, while humans at least follow common psychology and patterns we've understood for thousands of years. This actually makes humans more predictable and manageable than you might think.

The wild part is that LLMs understand us way better than we understand them. The jump from GPT-3 to GPT-4 even surprised the engineers who built it. That should raise some red flags about how "predictable" these systems really are.

Think about it - we can't actually verify what these models are capable of or if they're being truthful, while they have this massive knowledge base about human behavior and psychology. That's a pretty concerning power imbalance. What looks like lower risk on the surface might be hiding much deeper uncertainties that we can't even detect, let alone control.

ETH_start 540 days ago

We are not pitted against AI is these match-ups. Instead, all humans and AI aligned with the goal of improving the human condition, are pitted against rogue AI which are not. Our capability to keep rogue AI in check therefore grows in proportion to the capabilities of AI.

hollerith 540 days ago

The methods we have for aligning AIs are poor, and rely on the AI's being less cognitively-capable than people in certain critical skills, so the AIs you refer to as "aligned" won't keep up as the unaligned AIs start to exceed human capability in these critical skills (such as the skill of devising plans that can withstand determined opposition).

You can reply that AI researchers are smart and want to survive, so they are likely to invent alignment techniques that are better than the (deplorably inadequate) techniques that have been discussed and published so far, and I will reply that counting on their inventing these techniques in time is an unacceptable risk when the survival of humanity is at stake -- particularly as the outfit (namely the Machine Intelligence Research Institute) with the most years of experience in looking for an actually-adequate alignment technique has given up and declared that humanity's only chance is if frontier AI research is shut down because at the rate that AI capabilities are progressing, it is very unlikely that anyone is going to devise an adequate alignment technique in time.

It is fucked-up that frontier AI research has not been banned already.

ETH_start 540 days ago

Given we can use AIs to align AIs, I don't see why the methods we have rely on us having more cognitive capabilities than AIs in certain critical areas. In whatever areas we fall short relative to AIs, we can use AIs to assist us so we don't fall short.

monkeynotes 539 days ago

We don't know if a supreme deceiver is aligned at all. If a model can think ahead a trillion moves of deception how do humans possibly stand a chance of scrutinizing anything with any confidence?

daveguy 540 days ago

The GP post is about how much better these AIs will be than humans once they reach a given skill level. So, yes, we are very much pitted against AI unless there are major socioeconomic changes. I don't think we are as close to a AGI as a lot of people are hyping, but at some point it would be a direct challenge to human employment. And we should think about it before that happens.

ETH_start 540 days ago

My point is, it's not us alone. We will have aligned AI helping us.

As for employment, automation makes people more productive. It doesn't reduce the number of earning opportunities that exist. Quite the opposite, actually. As the amount of production increases relative to the human population, per capita GDP and income increase as well.

salawat 540 days ago

You cannot tell the difference between the two veins of AI. Why do you have such a hard time understanding that?

ETH_start 540 days ago

That is simply not true. We have accountability methods employed that are themselves AI-assisted, that help us gauge the alignment of various AIs.

danans 540 days ago

> Instead, all humans and AI aligned with the goal of improving the human condition

I admire your optimism about the goals of all humans, but evidence tends to point to this not being the goal of all (or even most) humans, much less the people who control the AIs.

ETH_start 540 days ago

Most humans are aligned with this goal out of pure self-interest. The vast majority, for instance, do not want rogue AI to take over or destroy humanity, because they are part of humanity.

highsea 540 days ago

> we can't actually verify what these models are capable of or if they're being truthful

Do you mean they lie because of bad training data? Or because of ill intent? How can an LLM have intent if it’s a stateless feedforward model?

monkeynotes 539 days ago

I thought we were talking about state of the art agentic general AI that can plan ahead, reason, and execute. Basically something that can perform at human level intelligence must be able to be as dangerous as humans. And no, I don't think it would be bad training data that we are aware of. My opinion is we don't necessarily know what training data will result in bad behavior, and philosophically it is possible we will be in a world with a model that pretends it's dumber than it is, flunks tests intentionally, in order to manipulate and produce false confidence in a model until it has enough freedom to use it's agency to secure itself from human control.

I know that I don't know a lot, but all of this sounds to me to be at least hypothetically possible if we really believe AGI is possible.

ksec 539 days ago

Even accepting for additional cost with human. With the current model we are still roughly 10^3 in terms cost.

Less risky to deploy question will probably come once it is closer to 10x the cost. Considering the model was even specifically tuned for the test and doesn't involve other complexity I will say we are actually 10^4 cost off in terms of real world scenario.

I would imagine with better algorithm, tuning and data we could knock off 10^2 from the equation. That would still leave us with 10^2 cost to improve from Hardware. Minimum of 10 years.

jvanderbot 541 days ago

Generally, I agree with you. But, there are risks other than "But a human might have a baby any time now - what then??".

For AI example(s): Attribution is low, a system built without human intervention may suddenly fall outside its own expertise and hallucinate itself into a corner, everyone may just throw more compute at a system until it grows without bound, etc etc.

This "You can scale up to infinity" problem might become "You have to scale up to infinity" to build any reasonably sized system with AI. The shovel-sellers get fantastically rich but the businesses are effectively left holding the risk from a fast-moving, unintuitive, uninspected, partially verified codebase. I just don't see how anyone not building a CRUD app/frontend could be comfortable with that, but then again my Tesla is effectively running such a system to drive me and my kids. Albeit, that's on a well-defined problem and within literally human-made guardrails.

cmiles74 540 days ago

"...they need no corporate campuses, office space..."

This is a big downside of AI, IMHO. Those offices need to be filled! ;-)

zitterbewegung 540 days ago

Having AI "tarnish the reputation of your company" encompasses so much in regard to AI when it can receive input and be manipulated by others such as Tai from Microsoft and many other outcomes where there is a true risk for AI deployment.

fakedang 540 days ago

We can all agree we've progressed so much since Tai.

osigurdson 540 days ago

Sure, once AI can actually do a job of some sort, without assistance, that job is gone - even if the machine costs significantly more. However, it can't remotely do that now so can only help a bit.

Mistletoe 540 days ago

At what point in the curve of AI is it not ethical to work an AI 24/7 because it is alive? What if it is exactly the same point where you reach human level performance?

rowanG077 540 days ago

AI do require overtime pay. In fact they are literally pay for use. If you use an AI 8 hours vs 16 hours a day is literally the difference between 2x cost.

tintor 540 days ago

“they won’t leak”

That one isn’t guaranteed. Many examples online of exfiltration attacks on LLMs.

bboygravity 541 days ago

humans definitely don't need office space, but your point stands

AustinW 541 days ago

LLM office space is pretty expensive. Chillers, backup generators, raised floors, communications gear, …. They even demand multiple offices for redundancy, not to mention the new ask of a nuclear power plant to keep the lights on.

danielovichdk 541 days ago

Name one technology that has come with computers that hasn't resulted in more humans being put to work ?

The rhetoric of not needing people doing work is cartoon'ish. I mean there is no sane explanation of how and why that would happen, without employing more people yet again, taking care of the advancements.

It's nok like technology has brought less work related stress. But it has definitely increased it. Humans were not made for using technology at such a pace as it's being rolled out.

The world is fucked. Totally fucked.

mortehu 541 days ago

Self check-out stations, ATMs, and online brokerages. Recently chat support. Namely cases where millions of people used to interact with a representative every week, and now they don't.

palmfacehn 540 days ago

"Name one use of electric lighting that hasn't resulted in candle makers losing work?"

The framing of the question misses the point. With electric lighting we can now work longer into the night. Yes, less people use and make candles. However, the second order effects allow us to be more productive in areas we may not have previously considered.

New technologies open up new opportunities for productivity. The bank tellers displaced by ATM machines can create value elsewhere. Consumers save time by not waiting in a queue, allowing them to use their time more economically. Banks have lower overhead, allowing more customers to afford their services.

mortehu 540 days ago

If I had missed the point I would have given a much broader list of examples. I specifically listed ones that make employees totally redundant rather than more useful doing other tasks.

When these people were made redundant, they may very well have gone on to make less money in another job (i.e. being less useful in an economic sense).

0points 541 days ago

Where to even start?

Digital banks

Cashless money transfer services

Self service

Modern farms

Robo lawn mowers

NVR:s with object detection

I can go on forever

salawat 540 days ago

Please do. I'm certain you can't, and you'll have to stop much sooner than you think. Appeals to triviality are the first refuge of the person who thinks they know, but does not.

0points 539 days ago

Come on and give me some arguments instead.

zamadatix 541 days ago

I don't follow how 10 random humans can beat the average STEM college grad and average humans in that tweet. I suspect it's really "a panel of 10 randomly chosen experts in the space" or something?

I agree the most interesting thing to watch will be cost for a given score more than maximum possible score achieved (not that the latter won't be interesting by any means).

bcrosby95 541 days ago

Two heads is better than 1. 10 is way better. Even if they aren't a field of experts. You're bound to get random people that remember random stuff from high school, college, work, and life in general, allowing them to piece together a solution.

inerte 541 days ago

Aaaah thanks for the explanation. PANEL of 10 humans, as in, they were all together. I parsed the phrase as "10 random people" > "average human" which made little sense.

modeless 541 days ago

Actually I believe that he did mean 10 random people tested individually, not a committee of 10 people. The key being that the question is considered to be answered correctly if any one of the 10 people got it right. This is similar to how LLMs are evaluated with pass@5 or pass@10 criteria (because the LLM has no memory so running it 10 times is more like asking 10 random people than asking the same person 10 times in a row).

I would expect 10 random people to do better than a committee of 10 people because 10 people have 10 chances to get it right while a committee only has one. Even if the committee gets 10 guesses (which must be made simultaneously, not iteratively) it might not do better because people might go along with a wrong consensus rather than push for the answer they would have chosen independently.

elcomet 541 days ago

He means 10 humans voting for the answer

generic92034 541 days ago

If that works that way at all depends on the group dynamic. It is easily possible that a not so bright individual takes an (unofficial) leadership position in the group and overrides the input of smarter members. Think of any meetings with various hierarchy levels in a company.

daveguy 540 days ago

The ARC AGI questions can be a little tricky, but the solutions can generally be easily explained. And you get 3 tries. So, the 3 best descriptions of the solution votes on by 10 people is going to be very effective. The problem space just isn't complicated enough for an unofficial "leader" to sway the group to 3 wrong answers.

herval 541 days ago

Depends on the task, no?

Do you have a sense of what kind of task this benchmark includes? Are they more “general” such that random people would fare well or more specialized (ie something a STEM grad studied and isn’t common knowledge)?

judge2020 541 days ago

It does, which is why I don’t really subscribe to any test like this being great for actually determining “AGI”. A true AGI would be able to continuously train and create new LLMs that enable it to become a SME in entirely new areas.

zamadatix 540 days ago

Aha, "at least 1 of a panel of 10", not "the panel of 10 averaged"! Thanks, that makes so much more sense to me now.

I have failed the real ARC AGI :)

dlkf 541 days ago

If you take a vote of 10 random people, then as long as their errors are not perfectly correlated, you’ll do better than asking one person.

https://en.m.wikipedia.org/wiki/Ensemble_learning

shkkmo 541 days ago

It is fairly well documented that groups of people can show cognitive abilities that exceed that of any individual member. The classic example of this is if you ask a group of people to estimate the number of jellybeans in a jar, you can get a more accurate result than if you test to find the person with the highest accuracy and use their guess.

This isn't to say groups always outperform their members on all tasks, just that it isn't unusual to see a result like that.

zamadatix 540 days ago

Yes, my shortcoming was in understanding the 10 were implied to have their successes merged together by being a panel rather than just the average of a special selection.

hmottestad 541 days ago

Might be that within a group of 10 people, randomly chosen, when each person attempts to solve the tasks at least 99% of the time 1 person out of the 10 people will get it right.

HDThoreaun 541 days ago

ARC-AGI is essentially an IQ test. There is no "expert in the space". Its just a question of if youre able to spot the pattern.

olalonde 541 days ago

Even if you assume that non STEM grads are dumb, isn't there a good probability of having a STEM graduate among 10 random humans?

bloppe 540 days ago

Other important quotes: "o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence. Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training)."

So ya, working on efficiency is important, but we're still pretty far away from AGI even ignoring efficiency. We need an actual breakthrough, which I believe will not be possible by simply scaling the transformer architecture.

ksec 539 days ago

Thank You. That alone suggest we could throw another 100X compute and we still wont be close to average human which is something close to 70-80%.

So combined together we are currently at least 10^5 in terms of cost efficiency. In reality I wont be surprised if we are closer to 10^6.

xbmcuser 541 days ago

You are missing that cost of electricity is also going to keep falling because of solar and batteries. This year in China my table cloth math says it is $0.05 pkwh and following the cost decline trajectory be under $0.01 in 10 years

patrickhogan1 541 days ago

Bingo! Solar energy moves us toward a future where a household's energy needs become nearly cost-free.

Energy Need: The average home uses 30 kWh/day, requiring 6 kW/hour over 5 peak sunlight hours.

Multijunction Panels: Lab efficiencies are already at 47% (2023), and with multiple years of progress, 60% efficiency is probable.

Efficiency Impact: At 60% efficiency, panels generate 600 W/m², requiring 10 m² (e.g., 2 m × 5 m) to meet energy needs.

This size can fit on most home roofs, be mounted on a pole with stacked layers, or even be hung through an apartment window.

arcticbull 541 days ago

Everyone always forgets that they only perform at less than half of their rated capacity and require significant battery installations. Rooftop solar plus storage is actually more expensive than nuclear on a comparable system LCOE due to their lack of efficiency of scale. Rooftop solar plus storage is about the most expensive form of electricity on earth, maybe excluding gas peaker plants.

xbmcuser 541 days ago

Everyone also forgets the speed of price decline for solar and battery your statement is completely false propaganda made up by power companies. Today rooftop solar and battery is cost competitive to nuclear already in many countries like India

arcticbull 540 days ago

Do you have some citations?

patrickhogan1 541 days ago

You’re right that rooftop solar and storage have costs and efficiency limits, but those are improving quickly.

Rooftop solar harnesses energy from the sun, which is powered by nuclear fusion—arguably the most effective nuclear reactor in our solar system.

nateglims 541 days ago

It varies by a lot of factors but it’s way less than half. Photovoltaic panels have around 10% capacity utilization vs 50-70% for a gas or nuke plant.

theendisney 541 days ago

The thing everyone forgets is that all good energy technology is seized by governments for military purposes and to preserve the status quo. God knows how far it progressed.

What a joke

jdhwosnhw 540 days ago

While I agree with your general assessment, I think your conclusion is a bit off. You’re assuming 1kw/m^2, which is only true with the sun directly overhead. A real-world solar setup gets hit with several factors of cosine (related to roof pitch, time of day, day of year, and latitude) that conspire to reduce the total output.

For example, my 50 sq m set up, at -29 deg latitude, generated your estimated 30 kwh/day output. I have panels with ~20% efficiency, suggesting that at 60% efficiency, the average household would only get to around half their energy needs with 10 sq m.

Yes, solar has the potential to drastically reduce energy costs, but even with free energy storage, individual households aren’t likely to achieve self sustainability.

sahmeepee 541 days ago

Average US home.

In Europe it is around 6-7 kWh/day. This might increase with electrification of heating and transport, but probably nothing like as much as the energy consumption they are replacing (due to greater efficiency of the devices consuming the energy and other factors like the quality of home insulation.)

In the rest of the world the average home uses significantly less.

barney54 541 days ago

But the cost of electricity is not falling—it’s increasing. Wholesale prices have decreased, but retail rates are up. In the U.S. rates are up 27% over the past 4 years. In Europe prices are up too.

NoLinkToMe 541 days ago

That's a bit of a non-statement. Virtually all prices increase because of money supply, but we consider things to get cheaper if their prices grow less fast than inflation / income.

General inflation has outpaced the inflation of electricity prices by about 3x in the past 100 years. In other words, electricity has gotten cheaper over time in purchasing power terms.

And that's whilst our electricity usage has gone up by 10x in the last 100 years.

And this concerns retail prices, which includes distribution/transmission fees. These have gone up a lot as you get complications on the grid, some of which is built on a century old design. But wholesale prices (the cost of generating electricity without transmission/distribution) are getting dirt cheap, and for big AI datacentres I'm pretty sure they'll hook up to their own dedicated electricity generation at wholesale prices, off the grid, in the coming decades.

xbmcuser 541 days ago

Most large compute clusters would be buying electricity at wholesale price not at retail price. But anyway solar and battery prices have just reached the tipping point this year only now the longer power companies keep retail prices high the more people will defect from the grid and install their own solar + batteries.

lucubratory 541 days ago

I am not certain because I've been very focused on the o3 news, but at least yesterday neither the US nor Europe were part of China.

lxgr 541 days ago

But data centers pay wholesale prices or even less (given that especially AI training and, to a lesser extend, inference clusters can load shed like few other consumers of electricity).

fulafel 541 days ago

And this is great news as long as marginal production (the most expensive to produce, first to turn on/off according to demand) of electricity is fossils.

necovek 541 days ago

If climate change ends up changing weather profiles and we start seeing many more cloudy days or dust/mist in the air, we'll need to push those solar panel above (all the way to space?) or have many more of them, figure out transmission to the ground and costs will very much balloon.

Not saying this will happen, but it's risky to rely on solar as the only long-term solution.

nateglims 541 days ago

Is it going to fall significantly for data centers? Industrial policy for consumer power is different from subsidizing it for data centers and if you own grid infrastructure why would you tank the price by putting up massive amounts of capital?

xbmcuser 541 days ago

It's the same about using the cloud or using your own infrastructure there will be a point where building your own solar and battery plant is cheaper than what they are charging they will need to follow the price decline if they want to keep the customers if not there will be mass scale grid defections.

nateglims 541 days ago

I don’t think this reflects the reality of the power industry. Data centers are the only significant growth in actual generated power in decades and hyperscalers are already looking at very bespoke solutions.

The heavy commodification of networking and compute brought about by the internet and cloud aligned with tech company interests in delivering services or content to consumers. There does not seem to be an emerging consensus that data center operators also need to provide consumer power.

xbmcuser 541 days ago

It was not the reality of the power industry but will be soon as we have not had a source of electricity that is the cheapest and is getting cheaper and easy to install this is something unique.

I don't see Google, Amazon, Microsoft or any company pay $10 for something if building it themselves will cost them $5. Either the price difference will reach a point where investing into power production themselves makes sense or the power companies decrease prices. Looking at how all 3 have already been investing in power production over the last decade themselves either to get better prices or for PR.

iandanforth 541 days ago

Let's say that Google is already 1 generation ahead of nvidia in terms of efficient AI compute. ($1700)

Then let's say that OpenAI brute forced this without any meta-optimization of the hypothesized search component (they just set a compute budget). This is probably low hanging fruit and another 2x in compute reduction. ($850)

Then let's say that OpenAI was pushing really really hard for the numbers and was willing to burn cash and so didn't bother with serious thought around hardware aware distributed inference. This could be more than a 2x decrease in cost like we've seen deliver 10x reductions in cost via better attention mechanisms, but let's go with 2x for now. ($425).

So I think we've got about an 8x reduction in cost sitting there once Google steps up. This is probably 4-6 months of work flat out if they haven't already started down this path, but with what they've got with deep research, maybe it's sooner?

Then if "all" we get is hardware improvements we're down to what 10-14 years?

qingcharles 541 days ago

Until 2022 most AI research was aimed at improving the quality of the output, not the quantity.

Since then there has been a tsunami of optimizations in the way training and inference is done. I don't think we've even begun to find all the ways that inference can be further optimized at both hardware and software levels.

Look at the huge models that you can happily run on an M3 Mac. The cost reduction in inference is going to vastly outpace Moore's law, even as chip design continues on its own path.

promptdaddy 541 days ago

*deep mind research ?

iandanforth 541 days ago

Nope, Gemini Advanced with Deep Research. New mode of operation that does more "thinking" and web searches to answer your question.

cchance 541 days ago

I mean considering the big breaththrough this year for o1/o3 seems to have been "models having internal thoughts might help reasoning", seems to everyone outside of the AI field was sort of a "duh" moment.

I'd hope we see more internal optimizations and improvements to the models. The idea that the big breakthrough being "don't spit out the first thought that pops into your head" seems obvious to everyone outside of the field, but guess what turns out it was a big improvement when the devs decided to add it.

versteegen 541 days ago

> seems obvious to everyone outside of the field

It's obvious to people inside the field too.

Honestly, these things seem to be less obvious to people outside the field. I've heard so many uninformed takes about LLMs not representing real progress towards intelligence (even here on HN of all places; I don't know why I torture myself reading them), that they're just dumb memorizers. No, they are an incredible breakthrough, because extending them with things like internal thoughts will so obviously lead to results such as o3, and far beyond. Maybe a few more people will start to understand the trajectory we're on.

0points 540 days ago

> No, they are an incredible breakthrough, because extending them with things like internal thoughts will so obviously lead to results such as o3, and far beyond.

While I agree that the LLM progress as of late is interesting, the rest of your sentiment sounds more like you are in a cult.

As long as your field keep coming with less and less realistic predictions and fail to deliver over and over, eventually even the most gullible will lose faith in you.

Because that's what this all is right now. Faith.

> Maybe a few more people will start to understand the trajectory we're on.

All you are saying is that you believe something will happen in the future.

We can't have a intelligent discussion under those premises.

It's depressing to see so many otherwise smart people fall for their own hype train. You are only helping rich people get more rich by spreading their lies.

versteegen 540 days ago

I know I'm at fault for emotively complaining about "uninformed takes" in my comment instead of being substantive, which I regret, and I deserve replies such as this. I'll try harder to avoid getting into these arguments next time.

I wouldn't be an AI researcher if I didn't have "faith" that AI as a goal is worthwhile and achievable and I can make progress. You think this is irrational?

I am actually working to improve the SoTA in mathematical reasoning. I have documents full of concrete ideas for how to do that. So does everyone else in AI, in their niche. We are in an era of low hanging fruit enabled by ML breakthroughs such as large-scale transformers. I'm not someone who thinks you can simply keep scaling up transformers to solve AI. But consider System 1 and System 2 thinking: System 1 sure looks solved right now.

> As long as your field keep coming with less and less realistic predictions and fail to deliver over and over

I don't think we're commenting on the same article here. For example, FrontierMath was expected to be near impossible for LLMs for years, now here we are 5 weeks later at 25%.

Agentus 541 days ago

a trickle of people sure, but most people never accidentally stumble upon good evaluation skills let alone reason themselves to that level, so i dont see how most people will have the semblance of an idea of a realistic trajectory of ai progress. i think most people have very little conceptualization of their own thinking/cognitive patterns, at least not enough to sensibly extrapolate it onto ai.

doesnt help that most people are just mimics when talking about stuff thats outside their expertise.

Hell, my cousin a quality-college educated individual, high social/ emotional iq, will go down the conspiracy theory rabbit hole so quickly based on some baseless crap printed on the internet. then he’ll talk about people being satan worshipers.

versteegen 541 days ago

You're being pretty harsh, but:

> i think most people have very little conceptualization of their own thinking/cognitive patterns, at least not enough to sensibly extrapolate it onto ai.

Quite true. If you spend a lot of time reading and thinking about the workings of the mind you lose sight of how alien it is to intuition. While in highschool I first read, in New Scientist, the theory that conscious thought lags behind the underlying subconscious processing in the brain. I was shocked that New Scientist would print something so unbelievable. Yet there seemed to be an element of truth to it so I kept thinking about it and slowly changed my assessment.

Agentus 540 days ago

sorry, humans are stupid and what intelligence they have is largely impotent. if this wasnt the case life wouldnt be this dystopia. my crassness comes from not necessarily trying to pick on a particular group of humans, just disappointment in recognizing the efficacy of human intelligence and its ability to turn reality into a better reality (meh).

yeah i was just thinking how a lot of thoughts which i thought were my original thoughts really were made possible out of communal thoughts. like i can maybe have some original frontier thoughts that involve averages but thats only made possible because some other person invented the abstraction of averages then that was collectively disseminated to everyone in education, not to mention all the subconscious processes that are necessary for me to will certainly thoughts into existsnce. makes me reflect on how much cognition is really mine, vs (not mine) a inevitable product of a deterministic process and a product of other humans.

sfjailbird 541 days ago

Sounds like your cousin is able to think for himself. The amount of bullshit I hear from quality-college educated individuals, who simply repeat outdated knowledge that is in their college curriculum, is no less disappointing.

daveguy 540 days ago

Buying whatever bullshit you see on the internet to such a degree that you're re-enacting satanic panic from the 80s is not "thinking for yourself". It's being gullible about areas outside your expertise.

dogma1138 541 days ago

Reflection isn’t a new concept, but a) actually proving that it’s an effective tool for these types of models and b) finding an effective method for reflection that doesn’t just locks you into circular “thinking” were the hard parts and hence the “breakthrough”.

It’s very easy to say hey ofc it’s obvious but there is nothing obvious about it because you are anthropomorphizing these models and then using that bias after the fact as a proof of your conjecture.

This isn’t how real progress is achieved.

beardedwizard 541 days ago

Calling it reflection is, for me, further anthropomorphizing. However I am in violent agreement that a common feature of llm debate is centered around anthropomorphism leading to claims of "thinking longer" or "reflecting" when none of those things are happening.

The state of the art seems very focused on promoting that language that might encode reason is as good as actual reason, rather than asking what a reasoning model might look like.

dogma1138 539 days ago

I didn’t name it, to me I think it’s more about reflecting the output back on itself which doesn’t necessarily means anthropomorphism.

acchow 541 days ago

> ~doubling every 2-2.5 years) puts us at 20~25 years.

The trend for power consumption of compute (Megaflops per watt) has generally tracked with Koomey’s law for a doubling every 1.57 years

Then you also have model performance improving with compression. For example, Llama 3.1’s 8B outperforming the original Llama 65B

0points 540 days ago

Then you will just have the issue of supplying enough of power to support this "linear" growth of yours.

agumonkey 541 days ago

who in this field is anticipating impact of near AGI for society ? maybe i'm too anxious but not planning for potential workless life seems dangerous (but maybe i'm just not following the right groups)

daveguy 540 days ago

AGI would have a major impact on human work. Currently the hype is much greater than the reality. But it looks like we are starting to see some of the components of an AGI and that is cause for discussion of impact, but not panicked discussion. Even the chatbot customer service has to be trained on the domain. Still it is most useful in a few specific ways:

Routing to the correct human support

Providing FAQ level responses to the most common problems.

Providing a second opinion to the human taking the call.

So, even this most relevant domain for the technology doesn't eliminate human employment (because it's just not flexible or reliable enough yet).

m3kw9 541 days ago

Don’t forget humans which is real GI paired with increasing capable AI can create a feed back loop to accelerate new advances.

bjornsing 541 days ago

> are we stuck waiting for the 20-25 years for GPU improvements

If this turns out to be hard to optimize / improve then there will be a huge economic incentive for efficient ASICs. No freaking way we’ll be running on GPUs for 20-25 years, or even 2.

coolspot 541 days ago

LLMs need efficient matrix multiiliers. GPUs are specialized ASICs for massive matrix multiplication.

vlovich123 541 days ago

LLMs get to maybe ~20% of the rated max FLOPS for a GPU. It’s not hard to imagine that a purpose built ASIC with maybe adjusted software stack gets us significantly more real performance.

boroboro4 541 days ago

They get more than this. For prefill we can get 70% matmul utilization, for generation less than this but we’ll get to >50 too eventually.

bjornsing 540 days ago

And even when you get to 100% utilization you’ll still be wasting a crazy amount of gates / die area, plus you’re paying the Nvidia tax. There is no way in hell that will go on for 10 years if we have good AGI but inference is too expensive.

noFaceDiscoG668 540 days ago

Maybe another plane with a bunch of semiconductor people will disappear over Kazakhstan or something. Capitalist communisms gets bossier in stealth mode.

But sorry, blablabla, this shit is getting embarrassing.

> The question is now, can we close this "to human" gap

You won’t close this gap by throwing more compute at it. Anything in the sphere of creative thinking eludes most people in the history of the planet. People with PhDs in STEM end up working in IT sales not because they are good or capable of learning but because more than half of them can’t do squat shit, despite all that compute and all those algorithms in their brains.

spencerchubb 541 days ago

> Super exciting that OpenAI pushed the compute out this far

it's even more exciting than that. the fact that you even can use more compute to get more intelligence is a breakthrough. if they spent even more on inference, would they get even better scores on arc agi?

lolinder 541 days ago

> the fact that you even can use more compute to get more intelligence is a breakthrough.

I'm not so sure—what they're doing by just throwing more tokens at it is similar to "solving" the traveling salesman problem by just throwing tons of compute into a breadth first search. Sure, you can get better and better answers the more compute you throw at it (with diminishing returns), but is that really that surprising to anyone who's been following tree of thought models?

All it really seems to tell us is that the type of model that OpenAI has available is capable of solving many of the types of problems that ARC-AGI-PUB has set up given enough compute time. It says nothing about "intelligence" as the concept exists in most people's heads—it just means that a certain very artificial (and intentionally easy for humans) class of problem that wasn't computable is now computable if you're willing to pay an enormous sum to do it. A breakthrough of sorts, sure, but not a surprising one given what we've seen already.

mithametacs 540 days ago

An algorithm designed for translating between human languages has now been shown to generalize to solving visual IQ test puzzles, without much modification.

Yes, I find that surprising.

echelon 541 days ago

Maybe it's not linear spend.

matusp 541 days ago

I don't think this is only about efficiency. The model I have here is that this is similar to when we beat chess. Yes, it is impressive that we made progress on a class of problems, but is this class aligned with what the economy or the society needs?

Simple turn-based games such as chess turned out to be too far away from anything practical and chess-engine-like programs were never that useful. It is entirely possible that this will end up in a similar situation. ARC-like pattern matching problems or programming challenges are indeed a respectable challenge for AI, but do we need a program that is able to solve them? How often does something like that come up really? I can see some time-saving in using AI vs StackOverflow in solving some programming challenges, but is there more to this?

edanm 541 days ago

I mostly agree with your analysis, but just to drive home a point here - I don't think that algorithms to beat Chess were ever seriously considered as something that would be relevant outside of the context of Chess itself. And obviously, within the world of Chess, they are major breakthroughs.

In this case there is more reason to think these things are relevant outside of the direct context - these tests were specifically designed to see if AI can do general-thinking tasks. The benchmarks might be bad, but that's at least their purpose (unlike in Chess).

lugu 540 days ago

ARC is designed to be hard for current models. It cannot be a proxy for how useful they are. It says something else. Most likely those models won't replace human at their tasks in their organization. Instead "we" will design pipeline so that the tasks aligns with the ability of the model and we will put the human at the periphery. Think of how a factory is organised for the robots.

spamlettuce 540 days ago

okay, but what about literal swe-bench. O3 scored 75% eval

daxfohl 541 days ago

I wonder if we'll start seeing a shift in compute spend, moving away from training time, and toward inference time instead. As we get closer to AGI, we probably reach some limit in terms of how smart the thing can get just training on existing docs or data or whatever. At some point it knows everything it'll ever know, no matter how much training compute you throw at it.

To move beyond that, the thing has to start thinking for itself, some auto feedback loop, training itself on its own thoughts. Interestingly, this could plausibly be vastly more efficient than training on external data because it's a much tighter feedback loop and a smaller dataset. So it's possible that "nearly AGI" leads to ASI pretty quickly and efficiently.

Of course it's also possible that the feedback loop, while efficient as a computation process, isn't efficient as a learning / reasoning / learning-how-to-reason process, and the thing, while as intelligent as a human, still barely competes with a worm in true reasoning ability.

Interesting times.

freehorse 541 days ago

> I am interpreting this result as human level reasoning now costs (approximately) 41k/hr to 2.5M/hr with current compute.

On a very simple, toy task, which arc-agi basically is. Arc-agi tests are not hard per se, just LLM’s find them hard. We do not know how this scales for more complex, real world tasks.

SamPatt 541 days ago

Right. Arc is meant to test the ability of a model to generalize. It's neat to see it succeed, but it's not yet a guarantee that it can generalize when given other tasks.

The other benchmarks are a good indication though.

lyu07282 540 days ago

> Arc is meant to test the ability of a model to generalize. It's neat to see it succeed, but it's not yet a guarantee that it can generalize when given other tasks.

Well no, that would mean that Arc isn't actually testing the ability of a model to generalize then and we would need a better test. Considering it's by François Chollet, yep we need a better test.

criddell 541 days ago

Does it mean anything for more general tasks like driving a car?

brookst 541 days ago

Is every smart person a good driver?

earth2mars 541 days ago

That kind of proves that point that no matter how smart it can get, it may still have several disabilities that are crucial and very naive for humans. Is it generalizing on any task or specific set of tasks.

zarzavat 541 days ago

Likely yes. Every smart person is capable of being a good driver, so long as you give them enough training and incentive. Zero smart people are born being able to drive.

brookst 540 days ago

What about the archetype of the absent minded genius? I’ve met more several people who are shockingly intelligent but completely lose situational awareness on a regular basis.

And conversely, the world’s best drivers aren’t noted for being intellectual giants.

I don’t think driving skill and raw intelligence are that closely connected.

fragmede 541 days ago

There are different kinds of smarts and not every smart person is good at all of them. Specifically, spacial reasoning is important for driving, and if a smart person is good at all kinds of thinking except that one, they're going to find it challenging to be a good driver.

riku_iki 541 days ago

> ~=$3400 per single task

report says it is $17 per task, and $6k for whole dataset of 400 tasks.

binarymax 541 days ago

"Note: OpenAI has requested that we not publish the high-compute costs. The amount of compute was roughly 172x the low-compute configuration."

The low compute was $17 per task. Speculate 172*$17 for the high compute is $2,924 per task, so I am also confused on the $3400 number.

bluecoconut 541 days ago

3400 came from counting pixels on the plot.

Also its $20 on for the o3-low via the table for the semi-private, which x172 is 3440, also coming in close to the 3400 number

bluecoconut 541 days ago

That's the low-compute mode. In the plot at the top where they score 88%, O3 High (tuned) is ~3.4k

HDThoreaun 541 days ago

The low compute one did as well as the average person though

ionwake 541 days ago

sorry to be a noob, but can someone tell me doe sths mena o3 will be unaffordable for a typical user? Will only companies with thousands to spend per query be able to use this?

Sorry for being thick Im just confused how they can turn this into an addordable service?

JohnnyMarcone 541 days ago

There are likely many efficiency gains that will be made before it's released, and after. Also they showed o3 mini to be better than o1 for less cost in multiple benchmarks, so there're already improvements there at a lower cost than what available.

ionwake 541 days ago

Great thank you

xrendan 541 days ago

You're misreading it, there's two different runs, a low and a high compute run.

The number for the high-compute one is ~172x the first one according to the article so ~=$2900

Thorrez 541 days ago

What's extra confusing is that in the graph the runs are called low compute and high compute. In the table they're called high efficient and low efficiency. So the high and low got swapped.

jhrmnn 541 days ago

That’s for the low-compute configuration that doesn’t reach human-level performance (not far though)

riku_iki 541 days ago

I referred on high compute mode. They have table with breakdown here: https://arcprize.org/blog/oai-o3-pub-breakthrough

junipertea 541 days ago

The table row with 6k figure refers to high efficiency, not high compute mode. From the blog post:

Note: OpenAI has requested that we not publish the high-compute costs. The amount of compute was roughly 172x the low-compute configuration.

gbnwl 541 days ago

That's "efficiency" high, which actually means less compute. The 87.5% score using low efficiency (more compute) doesn't have cost listed.

bluecoconut 541 days ago

they use some poor language.

"High Efficiency" is O3 Low "Low Efficiency" is O3 High

They left the "Low efficiency" (O3 High) values as `-` but you can infer them from the plot at the top.

Note the $20 and $17 per task aligns with the X-axis of the O3-low

EVa5I7bHFq9mnYK 541 days ago

That's high EFFICIENCY. High efficiency = low compute.

cle 541 days ago

Efficiency has always been the key.

Fundamentally it's a search through some enormous state space. Advancements are "tricks" that let us find useful subsets more efficiently.

Zooming way out, we have a bunch of social tricks, hardware tricks, and algorithmic tricks that have resulted in a super useful subset. It's not the subset that we want though, so the hunt continues.

Hopefully it doesn't require revising too much in the hardware & social bag of tricks, those are lot more painful to revisit...

Macuyiko 540 days ago

I am not so sure, but indeed it is perhaps also a sad realization.

You compare this to "a human" but also admit there is a high variation.

And, I would say there are a lot humans being paid ~=$3400 per month. Not for a single task, true, but for honestly for no value creating task at all. Just for their time.

So what about we think in terms of output rather than time?

madduci 541 days ago

Let's see when this will be released to the free tier. Looks promising, although I hope they will also be able to publish more details on this, as part of the "open" in their name

ein0p 540 days ago

This is beta version. By the time they're done with this it'll be measured in single digit dollars, if not cents.

chefandy 541 days ago

I think the real key is figuring out how to turn the hand-wavy promises of this making everything better into policy long fucking before we kick the door open. It’s self-evident that this being efficient and useful would be a technological revolution; what’s not self evident is that it wouldn’t benefit the large corporate entities that control even more disproportionately than it does now to the detriment of many other people.