| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by recipe19 335 days ago
	I work on niche platforms where the amount of example code on Github is minimal, and this definitely aligns with my observations. The error rate is way too high to make "vibe coding" possible. I think it's a good reality check for the claims of impending AGI. The models still depend heavily on being able to transform other people's work.

9 comments

winrid 335 days ago

Even with typescript Claude will happily break basic business logic to make tests pass.

link

motorest 335 days ago

> Even with typescript Claude will happily break basic business logic to make tests pass.

It's my understanding that LLMs change the code to meet a goal, and if you prompt them with vague instructions such as "make tests pass" or "fix tests", LLMs in general apply the minimum necessary and sufficient changes to any code that allows their goal to be met. If you don't explicitly instruct them, they can't and won't tell apart project code from test code. So they will change your project code to make tests work.

This is not a bug. Changing project code to make tests pass is a fundamental approach to refactoring projects, and the whole basis of TDD. If that's not what you want, you need to prompt them accordingly.

link

DecoySalamander 335 days ago

> This is not a bug

It's not a bug if we're talking about a mischievous jinn granting wishes instead of a productivity tool.

link

desdenova 335 days ago

A jinn granting wishes is the best analogy for LLMs I've seen so far.

link

Sharlin 335 days ago

Ecept that LLMs aren't mischievous. They're just stupid.

link

didgeoridoo 335 days ago

Can’t it be both?

link

Terr_ 335 days ago

> It's my understanding that LLMs change the code to meet a goal

I assume in this case you mean a broader conventional application, of which an LLM algorithm is a smaller-but-notable piece?

LLMs themselves have no goals beyond predicting new words for a document that "fit" the older words. It may turn 2+2 into 2+2=4, but it's not actually doing math with the goal of making both sides equal.

link

motorest 335 days ago

> I assume in this case you mean a broader conventional application, of which an LLM algorithm is a smaller-but-notable piece?

Not necessarily. If you prompt a LLM to limit changes to some projects or components, it complies with the request.

link

bitwize 334 days ago

> It's my understanding that LLMs change the code to meet a goal, and if you prompt them with vague instructions such as "make tests pass" or "fix tests", LLMs in general apply the minimum necessary and sufficient changes to any code that allows their goal to be met.

"I'm Mr. Meeseeks! Look at meeee!"

link

brailsafe 335 days ago

Do you mean not just LLMs, but agents? Is this jot avoided by narrowing your scope and just using the chat interface that also may not produce what you're hoping for, but at least can't muck about in your existing code?

link

winrid 334 days ago

I told it to add a feature and to update the tests. It added the feature, and then removed it because it made the tests fail lol. I know I can make it work, I did, that's not the point.

link

chuckadams 335 days ago

Fixing bugs is also changing project code to make tests pass. The assistant is pretty good at knowing which side to change when it’s working from documentation that describes the correct behavior.

link

desdenova 335 days ago

That's the main problem with vibe coding.

The whole point is having the LLM figure out what you want from vague hand-wavy descriptions instead of precise specification.

You don't need an LLM to parse a precise specification, you have a compiler for that.

link

chuckadams 334 days ago

It's entirely possible to have specifications somewhere between "vague hand-wavy descriptions" and source code. But it's really not my job to defend AI against all the people who want it to be completely useless, seem to need it to be so, really. I just use it, it works a lot of the time, doesn't work other times, and that's that. Results carry more weight than opinions.

link

motorest 335 days ago

> That's the main problem with vibe coding.

It's not a problem. It's in fact the core trait of vibe-codig. The primary work a developer does in vibe coding tasks is providing the necessary and sufficient context. Hence the inception of the term "context engineering". A vibe coder basically lays out requirements and constraints that drives LLMs to write code. That's the bulk of their task: they shift away from writing the low-level "how" to instead write down the high-level "what".

> The whole point is having the LLM figure out what you want from vague hand-wavy descriptions instead of precise specification.

No. The prompts are as elaborate as you want it to be. I, for example, use prompt files with the project's ubiquitous language and requirements, not to mention test suites used for acceptance tests. You can half-ass your code as much as you can half-ass your prompts.

link

immibis 334 days ago

Sounds like a compiler with extra steps.

link

bapak 335 days ago

Speaking of TypeScript, every time I feed a hard type problem to LLMs they just can't do it. Sometimes I find out it's a TS limitation or just not implemented yet, but that won't stop us from wasting 40 minutes together.

link

neom 335 days ago

We are building a tool specifically for typescript developers, just launched a couple of months ago and I'd really appreciate if you gave it a try and provided me with feedback, people seem to really like using it. http://charlielabs.ai - thank yooou!!! :)

link

rybosome 335 days ago

I’m currently doing research on this exact problem. Would you care to share an example of an advanced typing issue that you’ve seen LLMs struggle with?

link

rs186 335 days ago

When I vibe coded with GitHub Copilot in TypeScript, it keeps using "any" even though those variables had clear interfaces already defined somewhere in the code. This drove me crazy, as I had to go in and manually fix all those things. The only thing that helps a bit is me screaming "DO NOT EVER USE 'any' TYPE". I can't understand why it would do this.

link

CalRobert 335 days ago

That seems like the tests don’t work?

link

winrid 334 days ago

It made the tests fail with the new feature and then removed the feature it just added to make them pass.

link

paffdragon 335 days ago

Or at least they don't cover business logic if they pass while breaking it.

link

pygy_ 335 days ago

I've had a similar problem with WebGPU and WGSL. LLMs create buffers with the wrong flags (and other API usage errors), doesn't clean up resources, mix up GLSL and WGSL, write semi-less WGSL (in template strings) if you ask them to write semi-less [0] JS...

It's a big mess.

0. https://github.com/isaacs/semicolons/blob/main/semicolons.js

link

poniko 335 days ago

Yes and if you work with a plarform that has been arround for long time like .net you will most definitely get a mix of really outdated deprecated code mixed with the latest features.

link

remich 334 days ago

I recommend the context7 MCP tool for this exact purpose. I've been trying to really push agents lately at work to see where they fall down and whether better context can fix it.

As a test recently I instructed an agent using Claude to create a new MCP server in Elixir based on some code I provided that was written in Python. I know that, relatively speaking, Python is over-represented in training data and Elixir is under-represented. So, when I asked the agent to begin by creating its plan, I told it to reference current Elixir/Phoenix/etc documentation using context7 and to search the web using Kagi Search MCP for best practices on implementing MCP servers in Elixir.

It was very interesting to watch how the initially generated plan evolved after using these tools and how after using the tools the model identified an SDK I wasn't even aware of that perfectly fit the purpose (Hermes-mcp).

link

ragequittah 335 days ago

This is easily solved by feeding the LLM the correct documentation. I was having problems with tailwind because of this right up until I had ChatGPT deep research come up with a spec sheet on how to use the latest version of it. Fed it into the various AIs I've been using (worked for ChatGPT, Claude, and Cursor) and no problems since.

link

gompertz 335 days ago

Yep I program in some niche languages like Pike, Snobol4, Unicon. Vibe coding is out of the question for these languages. Forced to use my brain!

link

johnisgood 335 days ago

You could always feed it some documentation and example programs. I did it with a niche language and it worked out really well, with Claude. Around 8 months ago.

link

empressplay 335 days ago

I don't know if you're working with modern models. Grok 4 doesn't really know much about assembly language on the Apple II but I gave it all of the architectural information it needed in the first prompt of a conversation and it built compilable and executable code. Most of the issues I encountered were due to me asking for too much in a prompt. But it built a complete, albeit simple, assembly language game in a few hours of back and forth with it. Obviously I know enough about the Apple II to steer it when it goes awry, but it's definitely able to write 'original' code in a language / platform it doesn't inherently comprehend.

link

timschmidt 335 days ago

This matches my experience as well. Poor performance usually means I haven't provided enough context or have asked for too much in a single prompt. Modifying the prompt accordingly and iterating usually results in satisfactory output within the next few tries.

link

vineyardmike 335 days ago

Completely agree. I’m a professional engineer, but I like to get some ~vibe~ help on person projects after-work when I’m tired and just want my personal project to go faster. I’ve had a ton of success with go, JavaScript, python, etc. I had mixed-success with writing idiomatic Elixir roughly a year ago, but I’ve largely assumed that this would be resolved today, since every model maker has started aggressively filling training data with code, since we found the PMF of LLM code-assistance.

Last night I tried to build a super basic “barely above hello world” project in Zig (a language where IDK the syntax), and it took me trying a few different LLMs to find one that could actually write anything that would compile (Gemini w/ search enabled). I really wasn’t expecting it considering how good my experience has been on mainstream languages.

Also, I think OP did rather well considering BASIC is hardly used anymore.

link

andsoitis 335 days ago

> The models

The models don’t have a model of the world. Hence they cannot reason about the world.

link

pygy_ 335 days ago

I tried vibe coding WebGPU/WGSl, which is thoroughly documented, but has little actual code around, and LLMs are pretty bad at it right now.

They don't need a formal model, they need examples from which they can pilfer.

link

bawana 334 days ago

The theory is that language is an abstraction built on top of the world and therefore encompasses all human experience of the world. The problem will arise however when the world (aka nature) acts in an unexpected way outside human experience

link

hammyhavoc 335 days ago

"reason" is doing some heavy-lifting in the context of LLMs.

link

jjmarr 335 days ago

I've noticed the error rate doesn't matter if you have good tooling feeding into the context. The AI hallucinates, sees the bug, and fixes it for you.

link

cmrdporcupine 334 days ago

I find for these kinds of systems, if I pre-seed Claude Code with a read of the language manual (even the BNF etc) and a TLDR of what it is, results are far better. Just part of the initial prompt: read this summary page, read this grammar, and look at this example code.

I have had it writing LambdaMOO code, with my own custom extensions (https://github.com/rdaum/moor) and it's ... not bad considering.

link