| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by roadbuster 110 days ago

> The Claude C Compiler illustrates the other side: it optimizes for

> passing tests, not for correctness. It hard-codes values to satisfy

> the test suite. It will not generalize.

This is one of the pain points I am suffering at work: workers ask coding agents to generate some code, and then to generate test coverage for the code. The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code. At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").

The icing on the cake is that LLMs are producing so much code that humans are just rubber stamping all of it. Off to merge and build it goes.

I have no constructive recommendations; I feel the industry will keep their foot on the pedal until something catastrophic happens.

21 comments

WhyNotHugo 110 days ago

This is why you write the tests first and then the code. Especially when fixing bugs, since you can be sure that the test properly fails when the bug is present.

pmontra 110 days ago

When fixing bugs, yes. When designing an app not so much because you realize many unexpected things while writing the code and seeing how it behaves. Often the original test code would test something that is never built. It's obvious for integration tests but it happens for tests of API calls and even for unit tests. One could start writing unit tests for a module or class and eventually realize that it must be implemented in a totally different way. I prefer experimenting with the implementation and write tests only when it settles down on something that I'm confident it will go to production.

Karrot_Kream 110 days ago

Where I'm at currently (which may change) is that I lay down the foundation of the program and its initial tests first. That initial bit is completely manual. Then when I'm happy that the program is sufficiently "built up", I let the LLM go crazy. I still audit the tests though personally auditing tests is the part of programming I like the very least. This also largely preserves the initial architectural patterns that I set so it's just much easier to read LLM code.

In a team setting I try to do the same thing and invite team members to start writing the initial code by hand only. I suspect if an urgent deliverable comes up though, I will be flexible on some of my ideas.

koonsolo 110 days ago

> When fixing bugs, yes.

One thing I want to mention here is that you should try to write a test that not only prevents this bug, but also similar bugs.

In our own codebase we saw that regression on fixed bugs is very low. So writing a specific test for it, isn't the best way to spend your resources. Writing a broad test when possible, does.

Not sure how LLM's handle that case to come up with a proper test.

pipecmd 110 days ago

I'd argue the AI writing the tests shouldn't even know about the implementation at all. You only want to pass it the interface (or function signatures) together with javadocs/docstrings/equivalent.

GuB-42 110 days ago

I don't think it addresses the problem.

Writing the tests first and then writing code to pass the tests is no better than writing the code first then writing tests that pass. What matter is that both the code and the tests are written independently, from specs, not from one another.

I think that it is better not to have access to tests when first writing code, as to make sure to code the specs and not code the tests that test the specs as something may be lost in translation. It means that I have a preference for code first, but the ideal case would be for different people to do it in parallel.

Anyway, about AI, in an AI writes both the tests and the code, it will make sure they match no matter what comes first, it may even go back and forth between the tests and code, but it doesn't mean it is correct.

9rx 110 days ago

Tests are your spec. You write them first because that is the stage when you are still figuring out what you need to write.

Although TDD says that you should only write one test before implementing it, encouraging spec writing to be an iterative process.

Writing the spec after implementation means that you are likely to have forgotten the nuance that went into what you created. That is why specs are written first. Then the nuance is captured up front as it comes to mind.

GuB-42 110 days ago

Tests are not any more or any less of a spec than the code. If you are implementing a HTTP server for instance, RFC 7231 are your specs, not your tests, not your code.

I would say that which come first between specs and code depend on the context. If you are implementing a standard, the specs of the standard obviously come first, but if you are iterating, maybe for a user interface, it can make sense to start with the code so that you can have working prototypes. You can then write formal documents and tests later, when you are done prototyping, for regression control.

But I think that leaning on tests is not always a good idea. For example, let's continue with the HTTP server. You write a test suite, but there is a bug in your tests, I don't know, you confuse error 404 and 403. The you write your code, correctly, run the tests, see that one of your tests fail and tell you have returned 404 and not 403. You don't think much, after all "the tests are the specs", and change the code. Congratulations, you are now making sure your code is wrong.

Of course, the opposite can and do happen, writing the code wrong and making passing test without thinking about what you actually testing, and I believe that's why people came up with the idea of TDD, but for me, test-first flip the problem but doesn't solve it. I'd say the only advantage, if it is one, is that it prevents taking a shortcut and releasing untested code by moving tests out of the critical path.

But outside of that, I'd rather focus on the code, so if something are to be "the spec", that's it. It is the most important, because it is the actual product, everything else is secondary. I don't mean unimportant, I mean that from the point of view of users, it is better for the test suite to be broken than for the code to be broken.

9rx 110 days ago

> RFC 7231 are your specs

It is more like a meta spec. You still have to write a final spec that applies to your particular technical constraints, business needs, etc. RFC 7231 specifies the minimum amount necessary to interface with the world, but an actual program to be deployed into the wild requires much, much more consideration.

And for that, since you have the full picture not available to a meta spec, logically you will write it in a language that both humans and computers can understand. For the best results, that means something like Lean, Rocq, etc. However, in the real world you likely have to deal with middling developers straight out of learn to code bootcamps, so tests are the practical middle ground.

> I don't know, you confuse error 404 and 403.

Just like you would when writing RFC 7231? But that's what the RFC process is for. You don't have to skip the RFC process just because the spec also happens to be machine readable. If you are trying to shortcut the process, then you're going to have this problem no matter what.

But, even when shortcutting the process, it is still worthwhile to have written your spec in a machine-readable format as that means any changes to the spec automatically identify all the places you need to change in implementation.

> writing the code wrong and making passing test without thinking about what you actually testing

The much more likely scenario is that the code is right, but a mistake in the test leads it to not test anything. Then, years down the road after everyone has forgotten or moved on, when someone needs to do some refactoring there is no specification to define what the original code was actually supposed to do. Writing the test first means that you have proven that it can fail. That's not the only reason TDD suggests writing a test first, but it is certainly one of them.

> It is the most important, because it is the actual product

Nah. The specification is the actual product; it is what lives for the lifetime of the product. It defines the contract with the user. Implementation is throwaway. You can change the implementation code all day long and as long as the user contract remains satisfied the visible product will remain exactly the same.

GuB-42 110 days ago

> The much more likely scenario is that the code is right, but a mistake in the test leads it to not test anything.

What I usually do to prevent this situation is to write a passing test, then modify the code to make it fail, then revert the change. It also gives an occasion to read the code again, kind of like a review.

I have never seen this practice formalized though, good for me, this is the kind of things I do because I care, turning it into a process with Jira and such is a good way to make me stop caring.

mycall 110 days ago

Also, if you find after implementation that the spec wasn't specific enough, go ahead and refresh the spec and have the LLM redo the code, from scratch if necessary. Writing code is so cheap right now, it takes a different mindset in general.

usefulcat 110 days ago

Agreed 1000%. But that can be a lot of work; creating a good set of tests is nearly as much or often even more effort than implementing the thing being tested.

When LLMs can assist with writing useful tests before having seen any implementation, then I’ll be properly impressed.

byzantinegene 110 days ago

from experience, AI is bad at TDD. they can infer tests based on written code, but are bad at writing generalised test unless a clear requirement is given, so you the engineer is doing most of the work anyway.

9rx 110 days ago

My day job has me working on code that is split between two different programming languages. I'd say LLMs are pretty good at TDD in one of those languages and a hot mess in the other.

Which, funny enough, is a pretty good reflection of how I thought of the people writing in those languages before LLMs: One considers testing a complete afterthought and in the wild it is rare to find tests at all, and when they are present they often aren't good. Whereas the other brings testing as a first-class feature and most codebases I've seen generally contain fairly decent tests.

No doubt LLM training has picked up on that.

dustingetz 110 days ago

try this for a UI

porphyra 110 days ago

At my job we have a requirement for 100% test coverage. So everyone just uses AI to generate 10,000 line files of unit tests and nobody can verify anything.

harimau777 110 days ago

Exactly! It's frustrating how much developers get blamed for the outcomes of incompetent management.

philipallstar 110 days ago

> everyone just uses AI to generate 10,000 line files of unit tests and nobody can verify anything

This is not a guaranteed outcome of requiring 100% coverage. Not that that's a good requirement, but responding badly to a bad requirement is just as bad.

IAmGraydon 110 days ago

Yeah this is the exact kind of ridiculousness I've noticed as well - everything that comes out of an LLM is optimized to give you what you want to hear, not what's correct.

littlestymaar 110 days ago

Long time ago in France the mainstream view by computer people was that code or compute weren't what's important when dealing with computers, it is information that matters and how you process it in a sensible way (hence the name of computer science in French: informatique. And also the name for computer: “ordinateur”, literally: what sets things into order).

As a result, computer students were talked a lot (too much for most people's taste, it seems) about data modeling and not too much about code itself, which was viewed as mundane and uninteresting until the US hacker culture finally took over in the late 2000th.

Turns out that the French were just right too early, like with the Minitel.

msh 110 days ago

"Computer science is no more about computers than astronomy is about telescopes." -Dijkstra

scotty79 110 days ago

> The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code.

I always felt like that's the main issue with unit testing. That's why I used it very rarely.

Maybe keeping tests in the separate module and not letting th Agent see the source during writing tests and not letting agent see the tests while writing implemntation would help? They could just share the API and the spec.

And in case of tests failing another agent with full context could decide if the fix should be delegated to coding agent or to testing agent.

Herring 110 days ago

> At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").

Obvious question: why not? Let’s say you have competent devs, fair assumption. Maybe it’s because they don’t have enough time for solid QA? Lots of places are feature factories. In my personal projects I have more lines of code doing testing than implementation.

sarchertech 110 days ago

It’s because people will do what they’re incentivized to do. And if no one cares about anything but whether the next feature goes out the door, that’s what programmers will focus on.

Honestly I think the other thing that is happening is that a lot of people who know better are keeping their mouths shut and waiting for things to blow up.

We’re at the very peak of the hype cycle right now, so it’s very hard to push back and tell people that maybe they should slow down and make sure they understand what the system is actually doing and what it should be doing.

shigawire 110 days ago

Or if you say we should slow down your competence is questioned by others who are going very fast (and likely making mistakes we won't find until later).

And there is an element of uncertainty. Am I just bad at using these new tools? To some degree probably, but does that mean I'm totally wrong and we should be going this fast?

catlifeonmars 110 days ago

There is a saying: slow is smooth and smooth is fast.

I have personally outpaced some of my more impatient colleagues by spending extra time up front setting up test harnesses, reading specifications, etcetera. When done judiciously it pays off in time scales of weeks or less.

citizenpaul 110 days ago

oh yeah, let them dig a hole and charge sweet consultant rates to fix it. the the healing can begin

harimau777 110 days ago

Developers aren't given time to test and aren't rewarded if they do, but management will rain down hellfire upon their heads if they don't churn out code quickly enough.

ojo-rojo 110 days ago

How about a subsequent review where a separate agent analyzes the original issue and resultant code and approves it if the code meets the intent of the issue. The principle being to keep an eye out for manual work that you can describe well enough to offload.

Depending on your success rate with agents, you can have one that validates multiple criteria or separate agents for different review criteria.

g947o 110 days ago

You are fighting nondeterministic behavior with more nondeterministic behavior, or in other words, fighting probability with probability. That doesn't necessarily make things any better.

pyridines 110 days ago

In my experience, an agent with "fresh eyes", i.e., without the context of being told what to write and writing it, does have a different perspective and is able to be more critical. Chatbots tend to take the entire previous conversational history as a sort of canonical truth, so removing it seems to get rid of any bias the agent has towards the decisions that were made while writing the code.

I know I'm psychologizing the agent. I can't explain it in a different way.

citizenpaul 110 days ago

I think of it as they are additive biased. ie "dont think about the pink elephant ". Not only does this not help llms avoid pink elphants instead it guarantees that pink elephant information is now being considered in its inference when it was not before.

I fear thinking about problem solving in this manner to make llms work is damaging to critical thinking skills.

Foobar8568 110 days ago

Fresh eyes, some contexts and another LLM.

The problem is information fatigue from all the agents+code itself.

hex4def6 110 days ago

Aren't human coders also nondeterministic?

Assigning different agents to have different focuses has worked for me. Especially when you task a code reviewer agent with the goal of critically examining the code. The results will normally be much better than asking the coder agent who will assure you it's "fully tested and production ready"

samrus 110 days ago

Human coders are far more reliable. The only downside is speed, and therefore cost

tbossanova 110 days ago

Probably true

(Sorry.)

samrus 110 days ago

Slop on slop. Who watches rhe watchman?

cmrdporcupine 110 days ago

My only hope is that all of this push leads in the end to the adoption of more formal verification languages and tools.

If people are having to specify things in TLA+ etc -- even with the help of an LLM to write that spec -- they will then have something they can point the LLM at in order for it to verify its output and assumptions.

8note 110 days ago

> At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").

its fun having LLMs because it makes it quite clear that a lot of testing has been cargo-culting. did people ever check often that the tests check for anything meaningful?

Foobar8568 110 days ago

15years ago, I had tester writing "UI tests" / "User tests" that matched what the software was cranking out. At that time I just joined to continue at the client side so I didn't really worked on anything yet.

I had a fun discussion when the client tried to change values... Why is it still 0? Didn't you test?

And that was at that time I had to dive into the code base and cry.

mattacular 110 days ago

Test automation is kind of like a religion. It is comforting to believe that the solution to code is more code.

taatparya 110 days ago

Property testing could've helped

yanis_t 110 days ago

How long till the industry discover TDD?

HWR_14 110 days ago

> The icing on the cake is that LLMs are producing so much code that humans are just rubber stamping all of it.

I don't understand the value of that much code. What features are worth that much more than stability?

ZaoLahma 110 days ago

I think it boils down to how companies view LLMs and their engineers.

Some companies will do as you say - have (mostly clueless) engineers feed high level "wishes" to (entirely clueless) LLMs, and hope that everyone kind of gets it. And everyone will kind of get it. And everyone will kind of get it wrong.

Other companies will have their engineers explicitly treat the LLMs as collaborators / pair programmers, not independent developers. As an engineer in such a company, YOU are still the author of the code even if you "prompted" it instead of typing it. You can't just "fix this high level thing for me brah" and get away with it, but instead need to continuously interact with the LLM as you define and it implements the detailed wanted behaviors. That forces you to know _exactly_ what you want and ask for _exactly_ what you want without ambiguity, like in any other kind of programming. The difference is that the LLM is a heck of a lot quicker at typing code than you are.

Illniyar 110 days ago

Building a C compiler should not have this problem. There is probably a million test suites coming from outside the LLM that it can sue verify correctness.

harimau777 110 days ago

Honestly, unit tests (at least on the front-end) are largely wasted time in the current state of software development. Taking the time that would have been spent on writing unit tests and instead using it to write functionally pure, immutable code would do much more to prevent bugs.

There's also the problem that when stack rank time comes around each year no one cares about your unit tests. So using AI to write unit tests gives me time to work on things that will actually help me avoid getting arbitrarily fired.

I wish that software engineers were given the time to write both clean code and unit tests, and I wish software engineers weren't arbitrarily judged by out of touch leadership. However, that's not the world we live in so I let AI write my unit tests in order to survive.

DiscourseFan 110 days ago

You are overvaluing “clean code.” Code is code, it either works within spec or it doesn’t; or, it does but there are errors, more or less catastrophic, waiting to show themselves at any moment. But even in that latter case, no single individual can know for certain, no matter how much work they put in, that their code is perfect. But they can know its useable, and someone else can check to make sure it doesn’t blow something else up, and that is the most important thing.

msh 110 days ago

I like unit tests when I have to modify code that someone made years ago, as a basic sanity check.

DeathArrow 110 days ago

>LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code. At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").

You can use spec driven development and TDD. Write the tests first. Write failing code. Modify the code to pass the tests.

salawat 109 days ago

Mwahahahahaha! Suffer, devs, SUFFER! KNOW MY PAIN!

Ah hem... Welcome to the wonderful world of Quality Assurance, software developing audience. That part of the job, after you yeet your code over the fence, where the job is to bridge the gap between your madness, and the madness of the rest of the business. Here you will find: frustration, an ever present sense the rest of the world is just out to make your life more difficult, a creeping sense of despair, a hot ice pick in the back of your mind every time the language model does something syntactically valid, but completely nonsensical in the real world, the development of an ever increasing time horizon over which you can accurately predict the future, but no one will believe you anyway, a smoldering hatred of the overly confident executive with an over developed capacity for risk tolerance; a desire to run away and start a farm, and finally, a fundamental distrust of everything software, and all the people who write it.

Don't forget your complimentary test framework and swag bag on your way out, and remember, you're here forever. You can try to check out, but you can never leave.

mrighele 110 days ago

> The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code

This is true for humans too. Tests should not be written or performed by the same person that writes the code

nly 110 days ago

That's a complete fantasy world where companies have twice the engineers they actually need instead of half.

missingdays 110 days ago

> [Reviews] should not be written or performed by the same person that writes the code

> That's a complete fantasy world where companies have twice the engineers they actually need instead of half.

harimau777 110 days ago

Agreed, but then companies shouldn't complain about the consequences of understaffing their teams.

rcpffm 108 days ago

Thx. you hit the nail

bluefirebrand 110 days ago

> I have no constructive recommendations; I feel the industry will keep their foot on the pedal until something catastrophic happens

I can't wait. Maybe when shitty vibe coded software starts to cause real pain for people we can return to some sensible software engineering

I'm not holding my breath though

bentobean 110 days ago

This hits hard. I’m getting hit with so much slop at work that I’ve quietly stopped being all that careful with reviews.

SoftTalker 110 days ago

Um, you're supposed to write the tests first. The agents can't do this?

alexsmirnov 110 days ago

Actually, they extremely bad at that. All training data contains cod + tests, even if tests where created first. So far, all models that I tried failed to implement tests for interfaces, without access to actual code.

daliusd 110 days ago

They can, but should be explicitly told to do that. Otherwise they just everything in batches. Anyway pure TDD or not but tests catches only what you tell AI to write. AI does not now what is right, it does what you told it to do. The above problem wouldn’t be solved by pure TDD.