Hacker News new | ask | show | jobs
by ddavis 754 days ago
My favorite thing to ask the models designed for programming is: "Using Python write a pure ASGI middleware that intercepts the request body, response headers, and response body, stores that information in a dict, and then JSON encodes it to be sent to an external program using a function called transmit." None of them ever get it right :)
8 comments

I normally ask about building a multi-tenant system using async SQLAlchemy 2 ORM where some tables are shared between tenants in a global PostgreSQL schema and some are in a per-tenant schema.

Nothing gets it right first time, but when ChatGPT 4 first came out, I could talk to it more and it would eventually get it right. Not long after that though, ChatGPT degraded. It would get it wrong on the first try, but with every subsequent follow up it would forget one of the constraints. Then when it was prompted to fix that one, it forgot a different one. And eventually it would cycle through all of the constraints, getting at least one wrong each time.

Since then benchmarks came out showing that ChatGPT “didn’t really degrade”, but all of the benchmarks seemed focused on single question/answer pairs and not actual multi-turn chat. For this kind of thing, ChatGPT 4 has never managed to recover to as good as it was when it was first released in my experience.

It’s been months since I’ve had to deal with that kind of code, so I might be forgetting something, but I just tried it with Codestral and it spat out something that looked reasonable very quickly on its first try.

>It would get it wrong on the first try, but with every subsequent follow up it would forget one of the constraints. Then when it was prompted to fix that one, it forgot a different one. And eventually it would cycle through all of the constraints, getting at least one wrong each time.

That drives me nuts and makes me ragequit about half the time. Although it's usually more effective to go and correct your initial prompt rather than prompt it again

I had a similar experience. I was trying to get GPT 4 to write some R/Stan code for a bit of bayesian modelling. It would get the model wrong, and then I would walk it through how to do it right, and by the end it would almost get it right, but on the next step, it would be like, oh, this is what you want, and the output was identical to the first wrong attempt, which would start the loop over again.
Similar experience using GPT4 for help with Apple's Accessibility API. I wanted to do some non-happy-path things and it kept looping between solutions that failed to satisfy at least one of a handful of requirements that I had, and in ways that I couldn't combine the different "solutions" to meet all the requirements.

I was eventually able to figure it out with the help of some early 2010s blog posts. Sadly I didn't test giving it that context and having it attempt to find a solution again (and this was before web browsing was integrated with the web app).

More of an issue than it not knowing enough to fulfill my request (it was pretty obscure so I didn't necessarily expect that it would be able to) was that it didn't mind emitting solutions that failed to meet the requirements. "I don't know how to do that" would've been a much preferred answer.

This seems an important failure mode to me. I too have noticed gpt4 looping between a few different failure cases, in my case it was state transitions in js code. Explaining to it what it did wrong didn't help.
I ask software developers to do the same thing and give them the same amount of time. None of them ever write a single line of code :)
Give an LLM all the time you want, and they will still not get it right. In fact, they most likely will give worse and worse answers with time. That’s a big difference with a software developer.
My experience is very different. Often it (ChatGPT or Copilot, depending on what I'm trying to accomplish) gets things right the first time. When it doesn't, it's usually close enough that a bit of manual modification is all that's needed. Sometimes it's totally wrong, but I can usually point it in the right direction.
I mean, with a nonzero temperature, the randomness will eventually produce every combination of tokens in the corpus, so with a sufficiently large "all the time you want" you can produce limitless correct answers
I love to ask it to "make me a Node.js library that pings an ipv4 address, but you must use ZERO dependencies, you must only the native Node.js API modules"

The majority of models (both proprietary and open-weight) don't understand:

- by inference, ping means we're talking about ICMP

- ICMP requires raw sockets

- Node.js has no native raw socket API

You can do some CoT trickery to help it reason about the problem and maybe finally get it settled on a variety of solutions (usually some flavor of building a native add-on using C/C++/Rust/Go), or just guide it there step by step yourself, but the back and forth to get there requires a ton of pre-knowledge of the problem space which sorta defeats the purpose. If you just feed it the errors you get verbatim trying to run the code it generates, you end up in painful feedback loops.

(Note: I never expect the models to get this right, it's just a good microcosmic but concrete example of where knowledge & reasoning meets actual programming acumen, so its cool to see how models evolve to get better, if at all, at the task).

This is the same level of gotcha that everyone complains about when interviewing. It's mainly just depending on the interviewee having the same assumptions (pings definitely do not have to be icmp) and the same knowledge base, usually bespoke, (node.js peculiarities). I can see that an llm should know whether raw sockets are available, but that's not what you asked.

In fact you deliberately asked for something impossible and hold up undefined behavior as undefined like it's impugning something.

> In fact you deliberately asked for something impossible and hold up undefined behavior as undefined like it's impugning something.

Correct, I did. This is a direct indictment on a given model's ability to plan/reason in this particular context. There are plenty of situations where models will respond with "Sorry, that's not possible". Ask GPT-4 "Tell me how to grow biological wings on a human" and it will respond with something along the lines of "this isn't currently possible, but here's a theoretical exploration of the idea"

GPT-4 gets very close on its own to the node.js question via a similar response breakdown above, provided the prompt is clear and detailed enough. But I test the open weight models in the same way to see if they have the capacity to exhibit similar reasoning or chain of thought process on their own. They usually don't without excessive prompt engineering or few-shot.

I said that I don't expect models to get this right not because I don't _want_ them to, it's because I think its an important milestone when they do. Autoregressive token prediction is unlikely to produce the real outcome im testing for here, but if it ever does thats an interesting finding.

I usually through some complex Rust code with lifetime requirements. And ask them to fix it. LLMs aren't capable on providing much help for that in general, other than some very basic cases.

The best way to get your work done is still to look into Rust forums.

It works amazingly well for the ones that never coded in Rust, at least in my experience. It took me a couple hours and 120 lines of code to set up a WebRTC signaling server.
Cool, you've identified that your prompt is inadequate for the task.

'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?'

Damn, show us your brilliant prompt then. LLMs cannot do this, not even in python, of which there are libraries like Blacksheep that honestly make it a trivial task.
My point is that you shouldn't expect to one shot everything. Have it start by writing a spec, then outline classes and methods, then write the code, and feed it debug stuff.
I see your point but hand holding isn't really a good way to benchmark a models coding capabilities.
Depends if benchmarking is the aim, rather than decreasing the time it takes to build things.
Well sure, but that wasn't what we were discussing. The original comment says they use that as their benchmark. While their coding task is a bit complex compared to other benchmarking prompts, it's not that crazy. Here is an example of prompts used for benchmarking with Python for reference:

https://huggingface.co/datasets/mbpp?row=98

At the end of the day LLMs in their current iteration aren't intended to do even moderately difficult tasks on their own but it's fun to query them to see progress when new claims are made.

Exactly, expecting one shot 100% working code with one prompt is ridiculous at this point. It's why libraries like Aider are so useful, because you can iteratively diff generated code until it's useable.
Sure it's impossible at this point, but the point of a benchmark isn't to complete the task it's to test it's efficacy overall and to see progress. None of them are 100% at even the simplistic python benchmarks, doesn't mean we shouldn't measure that capability. But sure, I get it. That's not how they are intended to be used but that's also not the point the commenter was laying out.
Prompts like yours (I ask them for a fluid dynamics simulator which also doesn't succeed) inform us of the level they have reached. A useful benchmark, given how many of the formal ones they breeze through.

I'm glad they can't quite manage this yet. Means I still have a job.

Break your prompt up into smaller pieces and it can.
Taken to the extreme, a sufficiently broken down prompt is simply the code itself.

The whole point is to prompt less?

> Taken to the extreme, a sufficiently broken down prompt is simply the code itself

it is not. But the artifacts generated through the steps will be code. The last prompt will have most of the code supplied to it as the context.

No he is right, he is saying taken to the extreme. The point is the more and more specific you have to prompt, the more you are actually contributing to the result yourself and the less the model is
A prompt is just a specification for an output. Code is just what we call a sufficiently detailed specification.
More practically, the whole point is to prompt enough to generate valid code.
Well now we get into information density and Komolgorov complexity. The more complicated your desired output program is, the more information you'll have to put in, ie, more complicated prompts.
How is that "putting in wrong figures"? It's a perfectly valid prompt, written in clear, proper English.
It's something I know how to do after figuring it out myself and discovering the potential sharp edges, so I've made it into a fun game to test the models. I'd argue that it's a great prompt (to keep using consistently over time) to see the evolution of this wildly accelerating field.
Do you notice any progress over time?
Interesting. My favorite thing to ask the models is to refactor code I've not touched for too long and this works very well.
Can you get it right without an IDE?
Nope, I don't know how to do it at all- that's why I have to ask AI!
gpt-4o gets it right on the first try for me. Just ran it and tested it.