Hacker News new | ask | show | jobs
by eunoia 142 days ago
This is real. I’ve seen some baffling bugs in prompt based stop hook behavior.

When I investigated I found the docs and implementation are completely out of sync, but the implementation doesn’t work anyway. Then I went poking on GitHub and found a vibed fix diff that changed the behavior in a totally new direction (it did not update the documentation).

Seems like everyone over there is vibing and no one is rationalizing the whole.

6 comments

I’m happy to throw an LLM at our projects but we also spend time refactoring and reviewing each other’s code. When I look at the AI-generated code I can visualize the direction it’s headed in—lots of copy-pasted code with tedious manual checks for specific error conditions and little thought about how somebody reading it could be confident that the code is correct.

I can’t understand how people would run agents 24/7. The agent is producing mediocre code and is bottlenecked on my review & fixes. I think I’m only marginally faster than I was without LLMs.

> with tedious manual checks for specific error conditions

And specifically: Lots of checks for impossible error conditions - often then supplying an incorrect "default value" in the case of those error conditions which would result in completely wrong behavior that would be really hard to debug if a future change ever makes those branches actually reachable.

I always thought that the vast majority of your codebase, the right thing to do with an error is to propagate it. Either blindly, or by wrapping it with a bit of context info.

I don’t know where the LLMs are picking up this paranoid tendency to handle every single error case. It’s worth knowing about the error cases, but it requires a lot more knowledge and reasoning about the current state of the program to think about how they should be handled. Not something you can figure out just by looking at a snippet.

Training data from junior programmers or introductory programming teaching material. No matter how carefully one labels data, the combination of programming’s subjectivity (damaging human labeling and reinforcement’s effectiveness at filtering around this) and the sheer volume of low-experience code in the input corpus makes this condition basically inevitable.
Garbage in garbage out as they say. I will be the first to admit that Claude enables me to do certain things that I simply could not do before without investing a significant amount of time and energy.

At the same time, the amount of anti-patterns the LLM generates is higher than I am able to manage. No Claude.md and Skills.md have not fixed the issue.

Building a production grade system using Claude has been a fools errand for me. Whatever time/energy i save by not writing code - I end up paying back when I read code that I did not write and fixing anti-patterns left and right.

I rationalized by a bit - deflecting by saying this is AI's code not mine. But no - this is my code and it's bad.

> At the same time, the amount of anti-patterns the LLM generates is higher than I am able to manage. No Claude.md and Skills.md have not fixed the issue.

This is starting to drive me insane. I was working on a Rust cli that depends on docker and Opus decided to just… keep the cli going with a warning “Docker is not installed” before jumping into a pile of garbage code that looks like it was written by a lobotomized kangaroo because it tries to use an Option<Docker> everywhere instead of making sure its installed and quitting with an error if it isn’t.

What do I even write in a CLAUDE.md file? The behavior is so stupid I don’t even know how to prompt against it.

> I don’t know where the LLMs are picking up this paranoid tendency to handle every single error case.

Think about it, they have to work in a very limited context window. Like, just the immediate file where the change is taking place, essentially. Having broader knowledge of how the application deals with particular errors (catch them here and wrap? Let them bubble up? Catch and log but don't bubble up?) is outside its purview.

I can hear it now, "well just codify those rules in CLAUDE.md." Yeah but there's always edge cases to the edge cases and you're using English, with all the drawbacks that entails.

I have encoded rules against this in CLAUDE.md. Claude routinely ignores those rules until I ask "how can this branch be reached?" and it responds "it can't. So according to <rule> I should crash instead" and goes and does that.
The answer (as usual) is reinforcement learning. They gave ten idiots some code snippets, and all of them went for the "belt and braces" approach. So now thats all we get, ever. It's like the previous versions that spammed emojis everywhere despite that not being a thing whatsoever in their training data. I don't think they ever fixed that, just put a "spare us the emojis" instruction in the system prompt bandaid.
This is my biggest frustration with the code they generate (but it does make it easy to check if my students have even looked at the generated code). I dont want to fail silently or hard code an error message, it creates a pile of lies to work through for future debugging
Writing bad tests and error handling have been the worst performance part of Claude for me.

In particular writing tests that do nothing, writing tests and then skipping them to resolve test failures, and everybody's favorite: writing a test that greps the source code for a string (which is just insane, how did it get this idea?)

Seriously. Maybe 60% of the time I use claude for tests, the "fix" for the failing tests is also to change the application code so the test passes (in some cases it will want to make massive architecture changes to accomodate the test, even if there's an easy way to adapt the test to better fit the arch). Maybe half the time that's the right thing to do, but the other half the time it is most definitely not. It's a high enough error rate that it borderlines on useful.
Usually you want to fix the code that's failing a test.

The assumption is that your test is right. That's TDD. Then you write your code to conform to the tests. Otherwise what's the point of the tests if you're just trying to rewrite them until they pass?

Or deleting the test files to make all tests pass. It’s my personal favorite.
>Seems like everyone over there is vibing and no one is rationalizing the whole.

Claude Code creator literally brags about running 10 agents in parallel 24/7. It doesn't just seems like it, they confirmed like it is the most positive thing ever.

It's software engineering crack. Starting a project feels amazing, features are shipping, a complex feature in the afternoon - ezpz. But AI lacks permanence, for every feature you start over from scratch, except there is more of codebase now, but the context window is still the same. So there is drift, codebase randomizes, edge cases proliferate, and the implementation velocity slows down.

Full disclosure - I am a heavy codex user and I review and understand every line of code. I manually fight spurious tests it tries to add by pointing a similar one already exists and we can get coverage with +1 LOC vs +50. It's exhausting, but personal productivity is still way up.

I think the future is bright because training / fine-tuning taste, dialing down agentic frameworks, introducing adversarial agents, and increasing model context windows all seem attainable and stackable.

I usually have multiple agents up working on a codebase. But it's typically 1 agent building out features and 1 or 2 agents code reviewing, finding code smells, bad architecture, duplicated code, stale/dead code, etc.

I'm definitely faster, but there's a lot of LLM overhead to get things done right. I think if you're just using a single agent/session you're missing out on some of the speed gains.

I think a lot of the gains I get using an LLM is because I can have the multiple different agent sessions work on different projects at the same time.

I think that the current test suite is far too small. For the Claude Code codebase, a sensible next step would be to generate thousands of tests. Without that kind of coverage, regressions are likely, and the existing checks and review process do not appear sufficient to reliably prevent them. My request is that an entirely LLM-written feature should only be eligible for merge once all of those generated tests pass, so we have objective evidence that the change preserves existing behavior.
I know at least one of the companies behind a coding agent we all have heard of has called in human experts to clean up their vibe coded IAC mess created in the last year.
I switched to OpenCode, away from Claude-Code, because Claude-Code is _so_ buggy.
> When I investigated I found the docs and implementation are completely out of sync, but the implementation doesn’t work anyway.

That is not an uncommon occurrence in human-written code as well :-\

Someone said it best after one of those AWS outages from a fat-fingered config change:

> Automation doesn't just allow you to create/fix things faster. It also allows you to break things faster.

https://news.ycombinator.com/item?id=13775966

Edit: found the original comment from NikolaeVarius

What else could they do? If they don't vibecode Claude Code it is a bad look.
omg are you me? I had this exact same problem last week