Hacker News new | ask | show | jobs
by satisfice 604 days ago
Like nearly all the articles about AI doing "testing" or any other skilled activity, the last part of it admits that it is an unreliable method. What I don't see in this article-- which I suspect is because they haven't done any-- is any description of a competent and reasonably complete testing process of this method of writing "tests." What they probably did is to try this, feel good about it (because testing is not their passion, so they are easily impressed), and then mark it off in their minds as a solved problem.

The retort by AI fanboys is always "humans are unreliable, too." Yes, they are. But they have other important qualities: accountability, humility, legibility, and the ability to learn experientially as well as conceptually.

LLM's are good at instantiating typical or normal patterns (based on its training data). Skilled testing cannot be limited to typicality, although that's a start. What I'd say is that this is an interesting idea that has an important hazard associated with it: complacency on the part of the developer who uses this method, which turns things that COULD be missed by a skilled tester into things that are GUARANTEED to be missed.

3 comments

Author here: Yes, there are certain functions where writing good tests will be difficult for an LLM, but in my experience I've found that the majority of functions that I write don't need anything out of the ordinary and are relatively straightforward.

Using LLMs allows us to have much higher coverage than if we didn't use it. To me and our engineering team, this is a pretty good thing because in the time prioritization matrix, if I can get a higher quality code base with higher test coverage with minimal extra work, I will definitely take it (and in fact it's something I encourage our engineering teams to do).

Most of the base tests that we use were created originally by some of our best engineers. The patterns they developed are used throughout our code base and LLMs can take these and make our code very consistent, which I also view as a plus.

re: Complacency: We actually haven't found this to be the case. In fact, we've seen more tests being written with this method. Just think about how much easier it is to review a PR and make edits vs write a PR. You can actually spend your time enforcing higher quality tests because you don't have to do most of the boilerplate for writing a test.

Have you systematically tested this approach? It sounds like you are reporting on your good vibes. Your writing is strictly anecdotal.

I’ve been working with AI, too. I see what I’m guessing is the same unreliability that you admit in the last part of your article. For some reason, you are sanguine about it, whereas I see it as a serious problem.

You say you aren’t complacent, but your words don’t seem to address the complacency issue. “More tests” does not mean better testing, or even good enough testing.

Google “automation bias” and tell me what policies and procedures or training is in place to avoid it.

I would say that the complacency part is identifying good test with good coverage. I agree that writing test is one of the best use cases for LLMs, and it definitely saves engineers a lot of time. But if you follow them to blindly it is easy to get carried away by how easy it is to write tests that focus on coverage instead of actually testing more quality things. Which is what the previous comment was pointing at:

> which turns things that COULD be missed by a skilled tester into things that are GUARANTEED to be missed.

I do use LLMs to bootsrap my unit testing (because there is a lot boilerplate in unit tests and mocks), but I tend to finish the unit tests myself. This gives me confidence that my tests are accurate to the best of my knowledge.

Having good tests allows me to be more liberal with LLMs on implementation. I still only use LLMs to bootstrap the implementation, and I finish it myself. LLMs, being generative, are really good for ideating different implementations (it proposes implementations that I would never have thought of), but I never take any implementation as-is -- I always try to step through it and finish it off manually.

Some might argue that it'd be faster if I wrote the entire thing myself, but it depends on the problem domain. So much of what I do is involve implementing code for unsolved problems (I'm not writing CRUD apps for instance) that I really do get a speed-up from LLMs.

I imagine folks writing conventional code might spend more time fixing LLM mistakes and thus think that LLMs slow them down. But this is not true for my problem domain.

The answer to this is code review. If an LLM writes code for you - be it implementation or tests - you review it before you land it.

If you don't understand how the code works, don't approve it.

Sure, complacent developers will get burned. They'll find plenty of other non-AI ways to burn themselves too.

100% agree. We don't expect human developers to be perfect, why should we expect AI assistants. Code going to production should go through review.

I do think that LLMs will increase the volume of bad code though. I use Cursor a lot, and occasionally it will produce perfect code, but often I need to direct and refine, and sometimes throw away. But I'm sure many devs will get lazy and just push once they've got the thing working...

> 100% agree. We don't expect human developers to be perfect, why should we expect AI assistants.

I think the issue is that we are currently being sold that it is. I'm blown away by how useful AI is, and how stupid it can be at the same time. Take a look at the following example:

https://app.gitsense.com/?doc=f7419bfb27c896&highlight=&othe...

If you click on the sentence, you can see how dumb Sonnet-3.5 and GPT-4 can be. Each model was asked to spell-check and grammar-check the sentence 5 times each, and you can see that GPT-4o-mini was the only one that got this right all 5 times. The other models mostly got it comically wrong.

I believe LLM is going to change things for the better for developers, but we need to properly set expectations. I suspect this will be difficult, since a lot of VC money is being pumped into AI.

I also think a lot of mistakes can be prevented if you include in your prompt, how and why it did what it did. For example, the prompt that was used in the blog post should include "After writing the test, summarize how each rule was applied."

"I think the issue is that we are currently being sold that it is."

The message that these systems are flawed appears to be pretty universal to me:

ChatGPT footer: "ChatGPT can make mistakes. Check important info."

Claude footer: "Claude can make mistakes. Please double-check responses."

https://www.meta.ai/ "Messages are generated by AI and may be inaccurate or inappropriate."

etc etc etc.

I still think the problem here is science fiction. We have decades of sci-fi telling us that AI systems never make mistakes, but instead will cause harm by following their rules too closely (paperclip factories, 2001: A Space Odyssey etc).

Turns out the actual AI systems we have make mistakes all the time.

But on the other other hand, there's the commercials generated to sell new models or new model features, that FREQUENTLY lie about actual capabilities and fake demos and don't actually end with an equivalent amount of time going over how actual usage may be shit and completely unlike the advertisement.

I'd say parent is absolutely correct - we ARE being sold (quite literally, through promotional material, i.e. ads) that these models are way more capable than they actually are.

You do have to admit, the footer is extremely small and it's also not in the most prominent place. I think most "AI companies" probably don't go into a sales pitch saying "It's awesome, but it might be full of shit".

I do see your science fiction angle, but I think the bigger issue is the media, VCs, etc. are not clearly spelling out that we are nowhere near science fiction AI.

I appreciate the footer on Kagi Assistant: "Assistant can make mistakes. Think for yourself when using it" - a reminder that theres a tendency to outsource your own train of thought
I would have to imagine 90+ percent of people use LLM and AI to outsource their thought and most will not heed this warning. OpenAI might say "Check important info." but they know most people probably won't do a google search or visit their library to fact check things.
> We don't expect human developers to be perfect, why should we expect AI assistants.

What absolute nonsense. What an absurd false equivalence. It's not that we expect perfection or even human level performance from "AI". It's that the crap that comes out of LLMs is not even at the level of a first year student. I've never in my entire life reviewed the code of a junior engineer and seen them invent third party APIs from whole cloth. I've never had a junior send me code that generates a payload that doesn't validate at the first layer of the operation with zero manual testing to check it. No junior has ever asked me to review a pull request containing references to an open source framework that doesn't exist anywhere in my application. Yet these scenarios are commonplace in "AI" generated code.

That problem genuinely doesn't matter to me at all.

If an LLM hallucinates a method that doesn't exist I find out the moment I try and run the code.

If I'm using ChatGPT Code Interpreter (for Python) or Claude analysis mode (for JavaScript) I don't even have to intervene: the LLM can run in a loop, generating code, testing that it executes without errors and correcting any mistakes it makes.

I still need to carefully review the code, but the mistakes which cause it not to run at all are by far the least amount of work to identify.

Yes I've seen the dreck you produce with LLMs. Not a shining endorsement in my eyes.

https://news.ycombinator.com/item?id=41929174

Which of those did you think were dreck?

I think the source code for tools like this one is genuinely good code: https://github.com/simonw/tools/blob/main/extract-urls.html

What do you see that's wrong with that?