Hacker News new | ask | show | jobs
by hombre_fatal 17 days ago
Testing is an even more powerful subject here since we barely do it.

Testing is so hard that we'll agree that, e.g., TDD is great (e.g. ensure your tests actually test something, ensure your code is testable from the start) yet we never do it. And when we do write tests, we are on the hook to be eternally vigilant that they are not stale, that they test something real, that they are not redundant. And they often turn into an append-only file that you resent.

Meanwhile, AI is happy to write tests, do red-green TDD cycles, refactor them, prune them, update them, justify and defend them. It will even incidentally write tests for the most aloof vibe-coder by accident because they didn't specify otherwise.

Overnight, I went from never testing most of my side projects (except for, say, maybe unit tests in more straightforward things like a parser) to now everything is tested end-to-end. Every time I make a new directional / architectural decision, the tests the AI writes also encode it at the test level to reenforce the decision.

It's strictly a better world for software because AI can write and maintain tests.

> LLM-assisted codebases, at least these days, only stick together thanks to testing

But tests also help humans and ensure human-written software is robust. We only don't test because they are so costly to write and maintain, and our software has always suffered for it. Or the tests become such an unmaintainable mess that our software is now worse because of it!

3 comments

I've gotten a lot of value out of LLM tools but without extensive feedback and direction I've found even the newer version of Opus pretty bad at writing good tests. First drafts are full of tests with some of these characteristics:

- good test, wrong layer, would turn into a mess of "wait why'd tests over there blow up for changes over here"

- mostly-good test, subtle issue (yes, this is the status quo with most human-written tests, but the risk of not being careful is that you (or the agent you throw at crap) now is overconfident in your future changes)

- weird silly test like "assert that calling the six statements in this method in the right order do the right thing" without... actually calling the method itself... so don't protect against changes to the method?

For non-greenfield work I've recently been much more happy with their out of the box code change quality then their attempts at adding coverage for those changes!

I think this is an area that has remained hard because putting directives in CLAUDE.md or whatnot for tests is generally gonna be so generic to be useless, like "put tests in the right place" without more module-specific context. Whereas if I'm making a non-greenfield change, I'm thinking much more in my prompt about constraints on the code itself, and much less about the current shape/state/organization of the tests or what to direct it on.

Properly used it's great. Definitely improved my test coverage a lot.

But it's entirely still in the world of "people who'd care to write good code before will write good code faster; other people will just write mediocre code faster."

> Meanwhile, AI is happy to write tests, do red-green TDD cycles, refactor them, prune them, update them, justify and defend them. It will even incidentally write tests for the most aloof vibe-coder by accident because they didn't specify otherwise.

I read some AI generated tests and while it looks visually impressive, ultimately it wasn’t doing anything valuable? Why? because of all the mocks and scenarios that didn’t matter. And on top of that, tests are additional code to maintain.

These days, I don’t even bother with unit testing. They are a maintenance burden. I focus on integration test (whole modules) and if I have the time, on a harness to do e2e testing.

Many humans do the same thing. They write tests that mock so much of the actual code, the "test" is not testing much at all, except perhaps the developer's ability to basically turn the code inside out. It's often just a large volume of crap that has to be maintained (or eventually deleted.)

I agree integration tests are best, with some e2e testing for common scenarios.

I worked at a place that required unit tests for every new method or function. Arguments like "this other (integration) test already covers that. why do I have to add another test?" wouldn't fly. PR reviews would often degrade into arguments about testing and how all database access needed to be mocked...

> I read some AI generated tests and while it looks visually impressive, ultimately it wasn’t doing anything valuable

I just saw this comment yesterday about one of the tests from Bun’s rust rewrite: https://news.ycombinator.com/item?id=48314311 It reads in the raw source code and uses a regex to assert that “unsafe” is used.

> These days, I don’t even bother with unit testing. They are a maintenance burden

I’ve come to the same conclusion, but that’s only because I’m working on solo projects. I think they are probably worth it with multiple devs on the same project.

IMO with multiple devs the focus on integration (module-level) instead of unit (lower-level/helper-function-level) is even more important than on single-dev projects.
My instructions and steering is to force the LLM agent to focus on mocking only system boundaries and outside in unit testing. I've found these two make verbose but good tests that actually prove behavior pretty well.

But that was my testing strategy already, but writing one of those tests could take legitimately hours with how many columns and crazy rules our area has.

a11y testing is non-trivial. axe-core can automatically detect many types of issues. However, enough compliance (to avoid being sued) needs end-to-end testing and human judgement. e.g. keyboard traps, focus restoration, alt-text, etc.