| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by godelski 99 days ago

I can't believe we're back to advocating for TDD. It was a failed paradigm that last few times we tried it. This time isn't any different because the fundamental flaw has always been the same: tests aren't proofs, they don't have complete coverage.

Before anyone gets too confused, I love tests. They're great. They help a lot. But to believe they prove correctness is absolutely laughable. Even the most general tests are very narrow. I'm sure they help LLMs just as they help us, but they're not some cure all. You have to think long and hard about problems and shouldn't let tests drive your development. They're guardrails for checking bonds and reduce footguns.

Oh, who could have guessed, Dijkstra wrote about program completeness. (No, this isn't the foolishness of natural language programming, but it is about formalism ;)

https://www.cs.utexas.edu/~EWD/transcriptions/EWD02xx/EWD288...

6 comments

josephg 99 days ago

Testing works because tests are (essentially) a second, crappy implementation of your software. Tests only pass if both implementations of your software behave the same way. Usually that will only happen if the test and the code are both correct. Imagine if your code (without tests) has a 5% defect rate. And the tests have a 5% defect rate (with 100% test coverage). Then ideally, you will have a 5%^2 defect rate after fixing all the bugs. Which is 0.25%.

The price you pay for tests is that they need to be written and maintained. Writing and maintaining code is much more expensive than people think.

Or at least it used to be. Writing code with claude code is essentially free. But the defect rate has gone up. This makes TDD a better value proposition than ever.

TDD is also great because claude can fix bugs autonomously when it has a clear failing test case. A few weeks ago I used claude code and experts to write a big 300+ conformance test suite for JMAP. (JMAP is a protocol for email). For fun, I asked claude to implement a simple JMAP-only mail server in rust. Then I ran the test suite against claude's output. Something like 100 of the tests failed. Then I asked claude to fix all the bugs found by the test suite. It took about 45 minutes, but now the conformance test suite fully passes. I didn't need to prompt claude at all during that time. This style of TDD is a very human-time efficient way to work with an LLM.

mewpmewp2 99 days ago

I think there is a difference whether you do TDD or write tests after the fact to avoid regression. TDD can only work decently if you already know your specs very well, but not so much when you still need to figure them out, and need to build something actual to be able to figure it out.

josephg 99 days ago

Yes; I think this remains true with coding agents. If you need to do some exploration of the solution space, it makes sense to do that before writing tests. Once you have a clear, workable design, you can get the agent to make a battery of tests to make sure the final product works correctly.

aray07 98 days ago

This is great. The tests in this case are the spec. When you give the agent something concrete to fail against, it knows what done looks like.

The problem is if you skip that step and ask Claude to write the tests after.

godelski 98 days ago

  > Tests only pass if both implementations of your software behave the same way.

That's not true.

I even addressed this in my comment as did Dijkstra

josephg 98 days ago

What is untrue about this statement you quoted?

godelski 98 days ago

You can have software behave differently while passing the same tests.

Idk man, this is pretty easy to demonstrate. Start with a trivial example: test is that input (2,2) -> 4. Function 1 does multiplication, function 2 does exponentiation. Both functions pass the test.

Sure, simple example but illustrative examples should be simple. But add more complexity and I'll add more examples of functions where the outputs are the same for a given set of inputs. (There's a whole area of mathematics dedicated to this!) It's simple, but you also confidently claimed something that was trivial to disprove.

Your claim is true if and only if your tests have complete coverage. So, your claim is only true if you've done formal verification of your code. Which was what I said in the beginning and is what Dijkstra claimed as well.

josephg 98 days ago

I mean, yeah, I thought that was obvious. If you want to be a pedant:

> Tests only pass if both implementations of your software behave the same way in the exact area being tested.

As I said in my comment above. Tests are a crappy second implementation. The test in your example isn’t even defined outside the input range of (2,2). Tests are a stochastic tool. Tests can prove the presence of a bug, not their absence. Completeness isn’t something tests alone can provide. But in the choice between yolo coding and yolo coding plus tests, you’re obviously going to get fewer bugs with tests.

theshrike79 99 days ago

When you write tests with LLM-generated code you're not trying to prove correctness in a mathematically sound way.

I think of it more as "locking" the behavior to whatever it currently is.

Either you do the red-green-with-multiple-adversarial-sub-agents -thing or just do the feature, poke the feature manually and if it looks good then you have the LLM write tests that confirm it keeps doing what it's supposed to do.

The #1 reason TDD failed is because writing tests is BOORIIIING. It's a bunch of repetition with slight variations of input parameters, a ton of boilerplate or helper functions that cover 80% of the cases, but the last 20% is even harder because you need to get around said helpers. Eventually everyone starts copy-pasting crap and then you get more mistakes into the tests.

LLMs will write 20 test cases with zero complaints in two minutes. Of course they're not perfect, but human made bulk tests rarely are either.

godelski 98 days ago

  > you're not trying to prove correctness in a mathematically sound way.

  > "locking" the behavior to whatever it currently is.

These two sentences are incompatible

  > The #1 reason TDD failed is

Because spec is an ever evolving thing that cannot be determined a priori. And because it highly incentivized engineers to metric hack.

  > It's a bunch of repetition with slight variations

If that's how you're writing tests then you're writing them wrong. You have the wrong level of abstraction. Abstraction is not a dirty word. It solves these problems. Maybe juniors don't understand that abstraction and fuck it up while learning but making abstraction a dirty word is throwing the baby out with the bath water.

  > Eventually everyone starts copy-pasting crap

Which is a horrendous way to write code.

theshrike79 98 days ago

Locking behavior with tests isn't the same as comprehensive and foolproof tests. They might not cover every edge case, but will fail if the happy path starts failing for some reason.

And yes, copy-pasting is a horrendous way to write code, but everyone does it.

When you're adding the 1600th CRUD endpoint of your career to an enterprise Java/C# application, can you with all honesty say you will type every single character with the same thought and consideration every time?

Or do you just make one, copy-paste that one and modify accordingly?

Or if you write 20 unit tests with slight alterations you masterfully craft every single character to perfection?

I have a limited amount of energy to use every day, I choose to use it in places that matter. The hard bits that LLMs and copy-pasting can't speed up.

computerdork 99 days ago

Hmm, not so sure TDD is a failed paradigm. Maybe it isn't a pancea, but it is seems like it's changed how software development is done.

Especially for backend software and also for tools, seems like automated tests can cover quite a lot of use cases a system encounters. Their coverage can become so good that they'll allow you to make major changes to the system, and as long as they pass the automated tests, you can feel relatively confident the system will work in prod (have seen this many times).

But maybe you're separating automated testing and TDD as two separate concepts?

prerok 99 days ago

Indeed, they are two separate concepts.

I write lots of automated tests, but almost always after the development is finished. The only exception is when reproducing a bug, where I first write the test that reproduces it, then I fix the code.

TDD is about developing tests first then writing the code to make the tests pass. I know several people who gave it an honest try but gave up a few months later. They do advocate everyone should try the approach, though, simply because it will make you write production code that's easier to test later on.

computerdork 98 days ago

... hmm, just looked it up. According to some sites on the web, TDD was created by Kent Beck as apart of Extreme Programming in the 90's and automated testing is a big part of TDD. Having lived through that era, thinking back, would say that TDD did help to popularize automated testing. It made us realize that focusing a ton on writing tests had a lot of benefits (and yeah, most of us didn't do the test first development part).

But this is kind of splitting hairs on what TDD is, not too important.

mewpmewp2 99 days ago

I think tests in general are good, just not TDD as it forces you to what I think bad and narrow paradigm of thinking. I think e.g. it is better that I build the thing, then get to 90%+ coverage once I am sure this is what I would also ship.

godelski 98 days ago

That's the result I've seen with anyone who tries TDD. Their code ends up being very rigid, making it difficult to add new features and fix bugs. It just ends up making them over confident in their code's correctness. As if their code is bug free. It just seems like an excuse to not think and avoid doing the hard stuff.

godelski 98 days ago

  > But maybe you're separating automated testing and TDD as two separate concepts?

I hope it's clear that I am given my content and how I stress I write tests. The existence of tests do not make development TDD.

The first D in TDD stands for "driven". While my sibling comment explains the traditional paradigm it can also be seen in an iterative sense. Like just developing a new feature or even a bug. You start with developing a test, treating it like spec, and then write code to that spec. Look at many of your sibling comments and you'll see that they follow this framing. Think carefully about it and adversarially. Can you figure out its failure mode? Everything has a failure mode, so it's important to know.

Having tests doesn't mean they drive the development. So there's many ways to develop software that aren't TDD but have tests. The important part is to not treat tests as proofs or spec. They are a measurement like any other; a hint. They can't prove correctness (that your code does what you intend it to do). They can't prove that it is bug free. But they hint at those things. Those things won't happen unless we formalize the code and not only is that costly in time to formalize but often will result in unacceptable computational overhead.

I'll give an example of why TDD is so bad. I taught a class a year ago (upper div Uni students) and gave them some skeleton code, a spec sheet, and some unit tests. I explicitly told them that the tests are similar to my private tests, which will be used to grade them, but that they should not rely on them for correctness and I encourage them to write their own. The next few months my office hours were filled with "but my code passes the tests" and me walking students through the tests and discussing their limitations along with the instructions. You'd be amazed at how often the same conversations happened with the same students over and over. A large portion of the class did this. Some just assumed tests had complete coverage and never questioned them while others read the tests and couldn't figure out their limits. But you know the students who never struggled in this way? The students who first approached the problem through design and even understood that even the spec sheet is a guide. That it tells requirements, not completeness. Since the homeworks built on one another those students had the easiest time. Some struggled at first, but many of them got the right levels of abstraction that I know I could throw new features at them and they could integrate without much hassle. They knew the spec wasn't complete. I mean of course it wasn't, we told them from the get go that their homeworks were increments to building a much larger program. And the only difference between that and real world programming is that that isn't always explicitly told to you and that the end goal is less clear. Which only makes this design style more important.

The only thing that should drive the software development is an unobtainable ideal (or literal correctness). A utopia. This prevents reduces metric hacking, as there is none to hack. It helps keep you flexible as you are unable to fool yourself into believing the code is bug free or "correct". Your code is either "good enough" or not. There's no "it's perfect" or "is correct", there's only triage. So I'll ask you even here, can you find the failure mode? Why is that question so important to this way of thinking?

computerdork 97 days ago

Hmm, saying tests are just a hint seems to be under appreciating their significance. Yes, they do have bugs of their own, but as you said they are a measurement. Having them statistically reduces the chances of bugs reaching production. They don't remove them completely of course, but they do greatly decrease the rate of bugs (and have read the same thing, formal verification of the code is typically not worth the time and cost).

And just looked up TDD on wikipedia. Actually, the standard process is not to write all the tests first, then do the implementation. It's to do what a lot of devs already do, write some tests based on your requirements. Then, write the implementation for these tests. Then repeat, adding in more test for other paths through the system.

Didn't know this myself about TDD (I thought it was focus writing all the tests, then do the implementation). Yeah, TDD is actually a very practical approach and something I pretty much do in my own development. Instead of using a driver program to run your working code, just write unit tests to run it. And keep building your unit tests for every new feature or execution path you're working on. You'll miss a lot of them early on, but you fill out the rest at the end.

Now that I know, in my opinion, TDD was pretty amazing and changed our industry.

siva7 98 days ago

TDD and similiar test paradigms have all the same fundamental flaw -> It's testing for the sake of testing. You need to know exactly what you want in order to start, which isn't compatible with a competitive iterative workflow no matter how much TDD yells otherwise. TDD doesn't make sense in agile and fast iteration workflows, only in heavily regulated / restricted products.

tinodb 98 days ago

It certainly isn’t. It is more a way of discovery on how to implement something, with the benefit of being able to safely (and thus easily) change it later.

The 99 Bottles book by Sandi Metz [0] is a good short display of how it works and where it helps actually building maintainable software

[0] https://sandimetz.com/99bottles

mvdtnz 99 days ago

> But to believe they prove correctness is absolutely laughable.

You don't need to believe this to practice TDD. In fact I challenge you to find one single mainstream TDD advocate who believes this.

godelski 98 days ago

https://news.ycombinator.com/item?id=47333160

skeledrew 99 days ago

> But to believe they prove correctness is absolutely laughable.

Sounds like a lack of tests for the correct things.

godelski 98 days ago

True, but I seriously doubt people are writing formal proofs for their code. I've only seen this in niche academic circles and high security/safety settings. I also am pretty certain it's not what you're suggesting, but hey, I could be wrong