Hacker News new | ask | show | jobs
by recipe19 335 days ago
I work on niche platforms where the amount of example code on Github is minimal, and this definitely aligns with my observations. The error rate is way too high to make "vibe coding" possible.

I think it's a good reality check for the claims of impending AGI. The models still depend heavily on being able to transform other people's work.

9 comments

Even with typescript Claude will happily break basic business logic to make tests pass.
> Even with typescript Claude will happily break basic business logic to make tests pass.

It's my understanding that LLMs change the code to meet a goal, and if you prompt them with vague instructions such as "make tests pass" or "fix tests", LLMs in general apply the minimum necessary and sufficient changes to any code that allows their goal to be met. If you don't explicitly instruct them, they can't and won't tell apart project code from test code. So they will change your project code to make tests work.

This is not a bug. Changing project code to make tests pass is a fundamental approach to refactoring projects, and the whole basis of TDD. If that's not what you want, you need to prompt them accordingly.

> This is not a bug

It's not a bug if we're talking about a mischievous jinn granting wishes instead of a productivity tool.

A jinn granting wishes is the best analogy for LLMs I've seen so far.
Ecept that LLMs aren't mischievous. They're just stupid.
Can’t it be both?
> It's my understanding that LLMs change the code to meet a goal

I assume in this case you mean a broader conventional application, of which an LLM algorithm is a smaller-but-notable piece?

LLMs themselves have no goals beyond predicting new words for a document that "fit" the older words. It may turn 2+2 into 2+2=4, but it's not actually doing math with the goal of making both sides equal.

> I assume in this case you mean a broader conventional application, of which an LLM algorithm is a smaller-but-notable piece?

Not necessarily. If you prompt a LLM to limit changes to some projects or components, it complies with the request.

> It's my understanding that LLMs change the code to meet a goal, and if you prompt them with vague instructions such as "make tests pass" or "fix tests", LLMs in general apply the minimum necessary and sufficient changes to any code that allows their goal to be met.

"I'm Mr. Meeseeks! Look at meeee!"

Do you mean not just LLMs, but agents? Is this jot avoided by narrowing your scope and just using the chat interface that also may not produce what you're hoping for, but at least can't muck about in your existing code?
I told it to add a feature and to update the tests. It added the feature, and then removed it because it made the tests fail lol. I know I can make it work, I did, that's not the point.
Fixing bugs is also changing project code to make tests pass. The assistant is pretty good at knowing which side to change when it’s working from documentation that describes the correct behavior.
That's the main problem with vibe coding.

The whole point is having the LLM figure out what you want from vague hand-wavy descriptions instead of precise specification.

You don't need an LLM to parse a precise specification, you have a compiler for that.

It's entirely possible to have specifications somewhere between "vague hand-wavy descriptions" and source code. But it's really not my job to defend AI against all the people who want it to be completely useless, seem to need it to be so, really. I just use it, it works a lot of the time, doesn't work other times, and that's that. Results carry more weight than opinions.
> That's the main problem with vibe coding.

It's not a problem. It's in fact the core trait of vibe-codig. The primary work a developer does in vibe coding tasks is providing the necessary and sufficient context. Hence the inception of the term "context engineering". A vibe coder basically lays out requirements and constraints that drives LLMs to write code. That's the bulk of their task: they shift away from writing the low-level "how" to instead write down the high-level "what".

> The whole point is having the LLM figure out what you want from vague hand-wavy descriptions instead of precise specification.

No. The prompts are as elaborate as you want it to be. I, for example, use prompt files with the project's ubiquitous language and requirements, not to mention test suites used for acceptance tests. You can half-ass your code as much as you can half-ass your prompts.

Sounds like a compiler with extra steps.
Speaking of TypeScript, every time I feed a hard type problem to LLMs they just can't do it. Sometimes I find out it's a TS limitation or just not implemented yet, but that won't stop us from wasting 40 minutes together.
We are building a tool specifically for typescript developers, just launched a couple of months ago and I'd really appreciate if you gave it a try and provided me with feedback, people seem to really like using it. http://charlielabs.ai - thank yooou!!! :)
I’m currently doing research on this exact problem. Would you care to share an example of an advanced typing issue that you’ve seen LLMs struggle with?
When I vibe coded with GitHub Copilot in TypeScript, it keeps using "any" even though those variables had clear interfaces already defined somewhere in the code. This drove me crazy, as I had to go in and manually fix all those things. The only thing that helps a bit is me screaming "DO NOT EVER USE 'any' TYPE". I can't understand why it would do this.
That seems like the tests don’t work?
It made the tests fail with the new feature and then removed the feature it just added to make them pass.
Or at least they don't cover business logic if they pass while breaking it.
I've had a similar problem with WebGPU and WGSL. LLMs create buffers with the wrong flags (and other API usage errors), doesn't clean up resources, mix up GLSL and WGSL, write semi-less WGSL (in template strings) if you ask them to write semi-less [0] JS...

It's a big mess.

0. https://github.com/isaacs/semicolons/blob/main/semicolons.js

Yes and if you work with a plarform that has been arround for long time like .net you will most definitely get a mix of really outdated deprecated code mixed with the latest features.
I recommend the context7 MCP tool for this exact purpose. I've been trying to really push agents lately at work to see where they fall down and whether better context can fix it.

As a test recently I instructed an agent using Claude to create a new MCP server in Elixir based on some code I provided that was written in Python. I know that, relatively speaking, Python is over-represented in training data and Elixir is under-represented. So, when I asked the agent to begin by creating its plan, I told it to reference current Elixir/Phoenix/etc documentation using context7 and to search the web using Kagi Search MCP for best practices on implementing MCP servers in Elixir.

It was very interesting to watch how the initially generated plan evolved after using these tools and how after using the tools the model identified an SDK I wasn't even aware of that perfectly fit the purpose (Hermes-mcp).

This is easily solved by feeding the LLM the correct documentation. I was having problems with tailwind because of this right up until I had ChatGPT deep research come up with a spec sheet on how to use the latest version of it. Fed it into the various AIs I've been using (worked for ChatGPT, Claude, and Cursor) and no problems since.
Yep I program in some niche languages like Pike, Snobol4, Unicon. Vibe coding is out of the question for these languages. Forced to use my brain!
You could always feed it some documentation and example programs. I did it with a niche language and it worked out really well, with Claude. Around 8 months ago.
I don't know if you're working with modern models. Grok 4 doesn't really know much about assembly language on the Apple II but I gave it all of the architectural information it needed in the first prompt of a conversation and it built compilable and executable code. Most of the issues I encountered were due to me asking for too much in a prompt. But it built a complete, albeit simple, assembly language game in a few hours of back and forth with it. Obviously I know enough about the Apple II to steer it when it goes awry, but it's definitely able to write 'original' code in a language / platform it doesn't inherently comprehend.
This matches my experience as well. Poor performance usually means I haven't provided enough context or have asked for too much in a single prompt. Modifying the prompt accordingly and iterating usually results in satisfactory output within the next few tries.
Completely agree. I’m a professional engineer, but I like to get some ~vibe~ help on person projects after-work when I’m tired and just want my personal project to go faster. I’ve had a ton of success with go, JavaScript, python, etc. I had mixed-success with writing idiomatic Elixir roughly a year ago, but I’ve largely assumed that this would be resolved today, since every model maker has started aggressively filling training data with code, since we found the PMF of LLM code-assistance.

Last night I tried to build a super basic “barely above hello world” project in Zig (a language where IDK the syntax), and it took me trying a few different LLMs to find one that could actually write anything that would compile (Gemini w/ search enabled). I really wasn’t expecting it considering how good my experience has been on mainstream languages.

Also, I think OP did rather well considering BASIC is hardly used anymore.

> The models

The models don’t have a model of the world. Hence they cannot reason about the world.

I tried vibe coding WebGPU/WGSl, which is thoroughly documented, but has little actual code around, and LLMs are pretty bad at it right now.

They don't need a formal model, they need examples from which they can pilfer.

The theory is that language is an abstraction built on top of the world and therefore encompasses all human experience of the world. The problem will arise however when the world (aka nature) acts in an unexpected way outside human experience
"reason" is doing some heavy-lifting in the context of LLMs.
I've noticed the error rate doesn't matter if you have good tooling feeding into the context. The AI hallucinates, sees the bug, and fixes it for you.
I find for these kinds of systems, if I pre-seed Claude Code with a read of the language manual (even the BNF etc) and a TLDR of what it is, results are far better. Just part of the initial prompt: read this summary page, read this grammar, and look at this example code.

I have had it writing LambdaMOO code, with my own custom extensions (https://github.com/rdaum/moor) and it's ... not bad considering.