| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by franze 4 days ago

what am i missing?

you take a spec and create tests, every little thing

you use another ai to verify these tests against the spec

you review the tests vs the spec (at one point human review)

you put the tests off limits to change / wall them.

you let the ai write the software that fulfills the tests.

there will be some gaps where you repeat the cycle above

if the tests fulfill the spec, the code will fulfill the spec

6 comments

torben-friis 4 days ago

>you take a spec and create tests, every little thing

A spec detailed enough and unambiguous enough to be translated into machine execution deterministically is called code.

Unlike a compiler, AI can build with a spec that is not detailed enough or unambiguous enough: It does so by filling in the gaps with educated guesses.

This is safe if and only if you take the time to later read the output, understand what its guesses were, and judge wether they were acceptable. No AI can do this for you because the truth lies in your original intentions, which it does not have access to.

The jury is out there on how reliable and time consuming this is vs writing the code yourself; it is not immediately obvious that is faster or requires a smaller cognitive load.

link

hparadiz 4 days ago

Code is not a spec. It's an instruction set. It can be a spec if you try hard but that's not an inherent property of code. For example you can write code to be a compiler..that makes it a spec. But hello world is not a spec.

As for whether or not LLMs can write unit tests. The answer is yes.

link

recursive 4 days ago

Hello world is a spec. The spec says to produce the text hello world on standard output.

link

hparadiz 4 days ago

Try running it without a compatible ABI. See how far you get.

link

recursive 4 days ago

Not sure what the point is. We can update the spec with "in the presence of a compatible ABI".

link

hparadiz 4 days ago

All I'm saying is a program isn't VHS. It's a VHS tape. At that point it's largely philosophy. Can you reconstruct a VHS format from a VHS tape? Sure.

link

steveBK123 4 days ago

If each step requires micro-steps iterating with an LLM with human review to prevent hallucinations creeping in.. at some point you might just be better off letting the human do the work.

Particularly as tokenmaxxing has ended and people are being charged more economic prices. If the pricing 5-10x the way Uber,etc did on the path to profitability.. even more so.

link

Daishiman 4 days ago

> If each step requires micro-steps iterating with an LLM with human review to prevent hallucinations creeping in.. at some point you might just be better off letting the human do the work.

I mean for a lot of spec code people define the API signatures and let the llm run with it which is an excellent tradeoff.

link

officialchicken 4 days ago

IME, regulatory compliance is something you are rarely able to test for in a nice little box or with well-known suite. So there's no easy "this complies" in many situations, no matter how many lawyers, compliance officers, and llm's you run it past.

link

franze 4 days ago

so, whats the difference to human engineering?

other than there are "internal micro feedback loops" during development?

link

hedora 4 days ago

I walked down that path for a few months. The more you constrain LLM's, the more underhanded they behave in order to produce something that satisfies all the constraints.

Doing the above doesn't actually make the model smarter, so, if it couldn't get to correct code with fewer steps, then the light you see at the end of the tunnel is an oncoming train.

link

sigbottle 4 days ago

This is such an abstract principle that the principle itself cannot be refuted. The plan sounds fine on paper. "Just iterate bro". But it entirely depends on what rational agents you put into the system. Obviously, if I sub in a 5 year old child everywhere, this loop breaks. Humans and AI, sometimes one is better than the other at certain things, we're still learning.

The only way to test this is to test it out, in real life. Sometimes people see results, sometimes people don't. Note that yes, I am including the entire iteration process - even after iterating, people still don't see results with AI.

I have had both positive and negative experiences with AI, over multi-week projects. But apparently on hackernews, anything positive about AI is proof that AI is superhuman and taking over, and all follies about AI are lies by stupid humans who secretly have psychological dispositions to fear AI. Sometimes the AI genuinely isn't good enough. Are we not allowed to say that now? We might not know why, but it's just the truth.

The other solution is to formally analyze the entire space of possible actions the agent can take a priori. Then yes, you can definitively say whether or not the principle breaks or not. Can you, though? Can you give a formal specification for the space of possible actions for AI and show that your loop never breaks, or breaks less than humans, or any other sensible criteria? If not, then you can't just give an abstract principle and start making inferences from that.

link

bobkb 4 days ago

It’s impossible to write a spec that’s not ambiguous , complete and correct in natural languages. Thus prompts will always generate unreliable software.

link