| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by TechDebtDevin 754 days ago
	Damn, show us your brilliant prompt then. LLMs cannot do this, not even in python, of which there are libraries like Blacksheep that honestly make it a trivial task.

3 comments

bongodongobob 754 days ago

My point is that you shouldn't expect to one shot everything. Have it start by writing a spec, then outline classes and methods, then write the code, and feed it debug stuff.

link

TechDebtDevin 754 days ago

I see your point but hand holding isn't really a good way to benchmark a models coding capabilities.

link

Closi 754 days ago

Depends if benchmarking is the aim, rather than decreasing the time it takes to build things.

link

TechDebtDevin 754 days ago

Well sure, but that wasn't what we were discussing. The original comment says they use that as their benchmark. While their coding task is a bit complex compared to other benchmarking prompts, it's not that crazy. Here is an example of prompts used for benchmarking with Python for reference:

https://huggingface.co/datasets/mbpp?row=98

At the end of the day LLMs in their current iteration aren't intended to do even moderately difficult tasks on their own but it's fun to query them to see progress when new claims are made.

link

Closi 754 days ago

The original comment says nothing about benchmarking, they just say that an AI can’t one shot their complex task?

link

amne 754 days ago

When I read

"My favorite thing to ask the models designed for programming is ....... None of them ever get it right"

I read "benchmark".

link

bottom999mottob 754 days ago

Exactly, expecting one shot 100% working code with one prompt is ridiculous at this point. It's why libraries like Aider are so useful, because you can iteratively diff generated code until it's useable.

link

TechDebtDevin 754 days ago

Sure it's impossible at this point, but the point of a benchmark isn't to complete the task it's to test it's efficacy overall and to see progress. None of them are 100% at even the simplistic python benchmarks, doesn't mean we shouldn't measure that capability. But sure, I get it. That's not how they are intended to be used but that's also not the point the commenter was laying out.

link

ben_w 754 days ago

Prompts like yours (I ask them for a fluid dynamics simulator which also doesn't succeed) inform us of the level they have reached. A useful benchmark, given how many of the formal ones they breeze through.

I'm glad they can't quite manage this yet. Means I still have a job.

link

Closi 754 days ago

Break your prompt up into smaller pieces and it can.

link

qeternity 754 days ago

Taken to the extreme, a sufficiently broken down prompt is simply the code itself.

The whole point is to prompt less?

link

meiraleal 754 days ago

> Taken to the extreme, a sufficiently broken down prompt is simply the code itself

it is not. But the artifacts generated through the steps will be code. The last prompt will have most of the code supplied to it as the context.

link

buddhistdude 754 days ago

No he is right, he is saying taken to the extreme. The point is the more and more specific you have to prompt, the more you are actually contributing to the result yourself and the less the model is

link

meiraleal 754 days ago

Yes but the build up isn't manual. You go patching prompts with responses until the final result. The last prompt will be almost the whole code complete, obviously.

link

qeternity 751 days ago

Again, you are missing the "taken to the extreme".

What has happened to HN discourse recently?

link

achierius 754 days ago

A prompt is just a specification for an output. Code is just what we call a sufficiently detailed specification.

link

Closi 754 days ago

More practically, the whole point is to prompt enough to generate valid code.

link

bongodongobob 754 days ago

Well now we get into information density and Komolgorov complexity. The more complicated your desired output program is, the more information you'll have to put in, ie, more complicated prompts.

link