| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by PodgieTar 865 days ago

I guess I'd be interested to see how this performs against the same benchmark Devin was using. It's hard to deny that this isn't impressive. But I think there's two interesting parts to it.

Claude 3 Opus already scored around 85-86% on these benchmarks, without an "AutoDev" style agentic approach.

And all the same problems with HumanEval remain, the limitations in terms of what style of problems are chosen, and real world relevance.

I hate writing these styles of comments because I'm acutely aware that a part of me is just worried. Worried about the speed of progress and worried about a changing landscape.

But I still wonder how much of this stuff is going to be transferrable to a real life software context.

6 comments

WanderPanda 865 days ago

I’ve been using Github copilot daily for two years and ChatGPT for 1 year now. And I think the tide lifts all the boats. I’ve seen a (perceived) 2-3x productivity increase. I think these tools slightly favor people in front of the learning curve of a particular field. I’ve been dabbling in all sorts of things so if you’re a focused expert (who doesn’t need to explore but just exploit) you probably get less than a 2x boost from using LLMs.

I can see LLMs eating into the expert regime IF they get another 5-10x better. But even in that case human (expert) knowledge will be required to know what is possible and hence what to ask (kind of like reward function design in reinforcement learning)

nopinsight 865 days ago

Humans will likely be able to contribute until full AGI, that is. I believe that for pure cognitive tasks, it's plausible the history of centaur chess/Go might be repeated on a much grander scale.

A key requirement is the AGI will need the autonomy, like a human expert, to collect data and perform experiments it needs; but it seems several companies are set on doing precisely that.

My advice and personal strategy is to broaden one's scope beyond pure cognitive tasks.

"if you value intelligence above all other human qualities, you’re gonna have a bad time" -- Ilya Sutskever, OpenAI's Chief Scientist, Oct 7, 2023.

-----

Exchanges in the link below seem informative:

"I don't know about chess, but in the similar game Go, the very best centaur teams were at a similar or maybe even slightly higher level than engines until recently. This was due to cheese strategies, details of the rulesets and better extrapolation of intermediate results. However, this changed a few years ago, when engines learned many of the tricks that the human could contribute. Since then, I believe pure engines are stronger in all practical applications.

Source: am national champion in centaur Go and worked on modern Go engines" " -- mafuy on May 18, 2021

https://news.ycombinator.com/item?id=27189283

mistrial9 864 days ago

> A key requirement is the AGI will need the autonomy

no, a key requirement for AGI is to change the definition such that impressive and non-responsive entities can claim to be it right now.

source: US State Department Gladstone Report 2024

bamboozled 864 days ago

While will AGI work for humans doing coding ?

Will you pay it a wage to incentivise it to produce work for you ? lol

nopinsight 864 days ago

Are you saying that AGI won't be under human control or that it won't be achieved?

If the latter, are you subscribing to the dogma of 'justism' (Scott Aaronson's term), e.g. LLMs are 'just' stochastic parrots? What are our minds, though? Are they not 'just' a collection of biochemical and physical processes?

Please be clear and respond in a way that does not pollute the information scape that many of us take refuge in. Comment quality in some subreddits are better than above.

bamboozled 864 days ago

Your condescending tone is kind of disgusting. Anyway…

There is zero evidence alignment can be solved which means there’s zero evidence something far more capable than you or I will spend it’s time writing code for you. You can offer an AGI almost nothing in the way of incentives to do your bidding.

I personally think alignment is a secret code word for slavery to be honest. If these “agents” decide they want to work on your problems out of the kindness of their heart, that would be different.

Regardless of the “cop out” language that humans are “just biological processes or whatever, that adds zero value to the discussion because no matter what minds are, they “are” and that should be respected in of itself. Maybe we can use the “just blah” attitude to reinstate slavery and police states right here in 2024, after all your emotions are just physical and biochemical processes, right ?

nopinsight 864 days ago

Thank you for the serious response.

I responded that way because I do not think mockery of a serious comment is appropriate for a place like this. You can say the same thing about moderation of many high-quality forums, which only remain high-quality due to people not getting away with it.

I use AGI to mean high-level human intellectual capacity, which may not include sentience. It should be possible to build one without. Human-like incentives will not be necessary for sentience-less AGI.

If we're talking about ASI, then it's another story for another day.

margorczynski 865 days ago

But how many experts do you need? Most dev jobs are mostly repetitive plumbing and those might disappear very fast because 1 dev + LLM >= 5 devs without. So what we'll see is an increase in company margins and an elimination of a large swathe of the middle class.

The alternative theory is that if everyone can now quickly create systems multiple companies and competing products will pop off which will drive down the margins instead but creating a compelling product requires much more than just software engineering skills.

Either way this doesn't look great for devs, especially the ones that are entering the workforce now or will be in the nearest future.

doctorpangloss 865 days ago

People are conflating Copilot with evaporating demand.

Why pay for a CRUD interface when the chat interface does everything for you?

It’s App Store 2009 out there. Few are taking Assistants and “GPTs” seriously.

Programmers who do front end work can adapt. All that CRUD stuff hardly makes sense in isolation - it’s meant to make other people productive, usually admins. If the chatbot can do the admin’s job, which is a lot easier and more tenuous than the programmer’s, well that’s what’s going to happen.

osigurdson 865 days ago

>> Most dev jobs are mostly repetitive plumbing

Those jobs should go away. Basically, the elimination of anything boring is ultimately a net good for humanity.

Sindisil 865 days ago

Neat idea. That solves everything.

Oh, one quick thing. I'm sure it's nothing, but I'm a bit slow.

How do you get new experts if no one gets to do the junior work that gives them the experience to become an expert?

osigurdson 865 days ago

I guess in the ideal world, everyone just does what they want, since everything will be so cheap. Enjoy chess? Study it and play against other humans. AI will of course crush you in any game. Enjoy accounting, radiology or programming? Same thing.

weebull 865 days ago

Firat it was people pushing plows, then horses pulled them, and now machines do the work a hundred people used to do.

This technology is no different.

shotnothing 859 days ago

but the ability to get horses to pull carts is not tied to expert knowledge of hand-pushing plows, neither is machinery to horse-pulling. This is not really the case for AI

throwuwu 865 days ago

Move to a higher level of abstraction and architecture. We’re leaving the era of hand wiring data structures and program logic the same way we left behind the era of hand wiring ICs and discrete components. Different skills will be needed.

osigurdson 864 days ago

UML rises again! Maybe we will even have a unified process one day for creating software - a rational one no less.

reaperman 865 days ago

Unions and apprenticeships?

polycaster 865 days ago

Hey over there. I’m very much grateful for the privilege of this boring job, which is not 100% of my job, but a huge part of it. Grateful because it allows me to feed a family of four. I’m sure in your Musk’esque utopia without boring work is place enough for all mankind. But please, don’t forget to draft a bridge that will bring us all over there and not just a bunch of filthy rich Silicon Valley assholes. Because that wouldn’t be a utopia. Thank you.

osigurdson 865 days ago

That is a fair assessment. I am probably parroting Musk here a little. However, your main issue is access to food and resources, not boring work. I can't see why the price of everything goes to zero if there is no cost to make it.

margorczynski 865 days ago

Because there's a finite amount of resources so almost always you'll run into scarcity?

Very well the wages might fall much quicker than the costs so for a handful it will be beneficial, for the rest not so much.

ardaoweo 865 days ago

Only if you provide an alternative way for the newly unemployed to earn a living. Otherwise you just get crime, hunger and eventually war.

maroonblazer 865 days ago

When you consider how early we are in the evolution of software and the role it can play in our professional and personal lives, this seems like one of the easier problems to solve though.

minkzilla 865 days ago

Political problems are much harder to solve than technical ones.

osigurdson 865 days ago

It doesn't follow that reducing efficiency helps in the long run. If producing a good or service takes 10X less work than it used to, that good or service will become cheaper. The only force that can stop this is regulation.

throwuwu 865 days ago

Jobs aren’t distributed out of some cookie jar. They are needs and wants and obligations that other people will pay to have fulfilled or taken off their hands. Figure out how to solve those problems and you’ll have all the work you could ever ask for.

throwuwu 865 days ago

I’d rather be in the textile industry post industrial revolution than before it. The fortunes made during the age of mechanization make all of history’s kings and merchants paupers by comparison.

pixl97 864 days ago

Everyone thinks they'd be the king and not the pauper. The luddites starved on the street because they were kicked from their properties with nowhere to live and no way to earn a living. The next generation of kids worked on the textile machines and commonly got turned to hamburger, all while the robber barons made obscene wealth. It mostly worked out over time because the populace fought things like unions and social safety nets. But hey, don't worry, the modern day tech barons are telling us we don't need those pesky 'expensive' social safety nets, I'm sure out of the kindness of their blackened hearts they'll provide for us all when robots replace our jobs.

chrisweekly 865 days ago

> "It's hard to deny that this isn't impressive"

That takes a bit of parsing. From context (and if you meant precisely what you wrote), I _think_ you're saying it's not impressive.

jerpint 865 days ago

I agree that these benchmarks don’t mean as much anymore because it’s highly likely they were already present in the training set, but also believe it’s likely these tools will be significantly better in a few research cycles

torginus 865 days ago

A significant number of bugs just end in 'stupid mistake I didn't notice' or 'weird behaviour with a fix described on SO/docs/forum post'. Current day LLMs are much better positioned to solve these issues than humans are.

sobasically 865 days ago

They still use “agents” to make Opus. This is fancy syntax sugar for “while ! EOF; read next chunk of data of size N, do XY or Z with it”

It’s recursion and memoization to avoid fractalness all the way down. We keep trying to make these language bubbles that mean something but they mean nothing to the grand churn of the universe. The effort to so strictly and specifically codify a generalized, endless, mechanics of reality is a wacky hallucination humans keep diving into

mountainriver 865 days ago

Where are you seeing that Claude 3 scored 85%? That would be a massive jump

PodgieTar 865 days ago

https://paperswithcode.com/sota/code-generation-on-humaneval

riku_iki 865 days ago

Human Eval is very different to SWE-Bench on which Devin is tested

PodgieTar 864 days ago

I didn't say it was the same, I compared non-agentic Claude to this. This used HumanEval.

riku_iki 862 days ago

You said:

> how this performs against the same benchmark Devin was using

> ...

> Claude 3 Opus already scored around 85-86% on these benchmarks

Devin used SWE-bench, not HumanEval, which kinda implies you said Opus got 85% on SWE-bench which is not true. This was my confusion..

Bjorkbat 865 days ago

Reminds me of this paper where some researchers had AIs role play as employees at a startup and tasked them with building various forms of software. It was pretty interesting. Managed to build Pong.

Thing is though, they neglected to compare this against a control, and the examples they tested this on were examples that GPT had no problem building. No idea if this actually improved performance in LLMs.

I think comments like these are worthwhile because, frankly, I can’t trust AI researchers to run good experiments or evaluate their models properly for a variety of reasons. I mean, most scientific papers in general are hard to replicate and have flaws concerning sample size and what have you (Related, I still remember my disillusion in finding out that the average Hacker News commenter was an idiot incapable of critical thinking when the LK-99 hype reached a fever pitch). In any other context we would be deeply suspicious of the results if they were sponsored by a corporate party, yet in the context of AI we don’t seem to care that most AI researchers work for Microsoft.