| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wg0 44 days ago

Snake oil. Good to read for sure. Seems all plausible too. But snake oil nevertheless.

Here's why: The slot machine can drop any hard requirement that you specifically in your AGENTS.md, memory.md or your dozens of skill markdowns. Pretty much guaranteed.

These harnesses approaches pretend as if LLMs are strict and perfect rule followers and the only problem is not being able to specify enough rules clearly enough. That's fundamental cognitive lapse in how LLMs operate.

That leaves only one option not reliable but more reliable nevertheless: Human review and oversight. Possibly two of them one after the other.

Everything else is snake oil but at that point, you also realize that promised productivity gains are also snake oil because reading code and building a mental model is way harder than having a mental model and writing it into code.

14 comments

keeda 44 days ago

Snake oil may be a bit strong, because snake oil never works (except maybe as placebo?) whereas anything with an LLM, even though stochastic, has a pretty high chance of working.

> ... you also realize that promised productivity gains are also snake oil because reading code and building a mental model is way harder than having a mental model and writing it into code.

Not really, though it depends on the code; reading code is a skill that gets easier with practice, like any other. This is common any time you're ever in a situation where you're reading much more code than writing it (e.g. any time you have to work with a large, sprawling codebase that has existed long before you touched it.)

What makes it even easier, though, is if you're armed with an existing mental model of the code, either gleaned through documentation, or past experience with the code, or poking your colleagues.

And you can do this with agents too! I usually already have a good mental model of the code before I prompt the AI. It requires decomposing the tasks a bit carefully, but because I have a good idea of what the code should look like, reviewing the generated code is a breeze. It's like reading a book I've read before. Or, much more rarely, there's something wrong and it jumps out at me right away, so I catch most issues early. Either way the speed up is significant.

jazzypants 43 days ago

I think the placebo effect might be a decent comparison. It works most of the time, and you don't worry about it as long as you fully believe in its efficacy. However, once the illusion is shattered, the positive effects are diminished, and you can never fully trust the solution again.

intended 44 days ago

> has a pretty high chance of working.

for MVPs, mock ups, prototypes or in the hands of an expert coder. You can't let them go unsupervised. The promise of automated intelligence falls far short of the reality.

crimsoneer 43 days ago

Not only "has a high chance of working", but you can pay more to make it more reliable. It really is striking trying to run a harness openClaw thing on a smaller or quantised model, really makes you realise how much we take for granted from SOTA models that was totally impossible just a year ago, in terms of complex, generally reliable tool use.

j45 44 days ago

Pretty high chance isn’t what the intent or impression the end user often has.

kergonath 43 days ago

Indeed, and it is a complicated problem to solve. A GUI or CLI can hide footguns or make them less likely to be misused. But an AI agent is perfectly happy to use a wrecking ball to put a nail without any second thought or confirmation.

j45 43 days ago

It’s a human articulation problem.

When it receives a generic vague input it is free to interpret according to how its corpus fires like any human interaction.

How to articulate better is like writing a sentence that will stand the test of model updates.

kergonath 43 days ago

Even then. I don’t have an example off the top of my head but even perfectly clear sentences can lead the agent to strange places. Even between humans, miscommunication is easy, but then anyone sensible would ask for confirmation if their interpretation is weird. But the LLM very rarely questions the user.

I don’t think it’s fair to blame the user here. The tool must be operated by normal users.

j45 43 days ago

I'm trying to think of other types of tooling that normal users can all use equally well, or in the best ways possible.

vidarh 44 days ago

Humans also drop any hard requirements you specify regularly, and similarly require review. Nevertheless we manage to increase reliability of human output through processes and reviews, and most of the methods we use for harnesses are taken from experience with how to reduce reliability issues in humans, who are notoriously difficult to ensure delivers reliably.

kaashif 44 days ago

The primary way to increase reliability is to automate. Instead of humans producing some output manually, humans producing machines which produce that output.

I've seen a disturbing trend where a process that could've been a script or a requirement that could've been enforced deterministically is in fact "automated" through a set of instructions for an LLM.

vidarh 44 days ago

Sure, when that is possible. However, there are lots of processes we don't know how to automate in a deterministic way. Hence the vast amount of investment in building organisations of people with mechanism to make peoples output more reliable through structure, reviews, and so on.

Large parts of human civilization rests on our ability to make something unreliable less unreliable through organisational structure and processes.

intended 44 days ago

We resolve that through liability, penalties, trust, responsibility, review and oversight.

At the end of the day, if I am spending X$s for automation, I want to be able to sleep at night knowing my factory will not build a WMD or delete itself.

If its simply a tool that is a multiplier for experts, then do I really need it? How much does it actually make my processes more efficient, faster, or more capable of earning revenue?

There is a LOT that is forgiven when tech is new - but at some point the shiny newness falls off and it is compared to alternatives.

vidarh 43 days ago

Liability, penalties, trust, and responsibility are means we use to try to influence the application of the processes that do. They do not directly affect reliability. They can be applied just as much to a team using AI as one that does not.

Review and oversight does address reliability directly, and hence why we make use of those in processes to improve the reliability of mechanical processes as well, and why they are core elements of AI harnesses.

> If its simply a tool that is a multiplier for experts, then do I really need it? How much does it actually make my processes more efficient, faster, or more capable of earning revenue?

You can ask the same thing about all the supporting staff around the experts in your team.

> There is a LOT that is forgiven when tech is new - but at some point the shiny newness falls off and it is compared to alternatives.

Only teams without mature processes are not doing that for AI today.

Most of the deployments of AI I work on are the outcome of comparing it to alternatives, and often are part of initiatives to increase reliability of human teams jut as much as increasing raw productivity, because they are often one and the same.

intended 43 days ago

> Liability, penalties, trust, and responsibility are means we use to try to influence the application of the processes that do. They do not directly affect reliability. They can be applied just as much to a team using AI as one that does not.

Yes and no. see next point.

> You can ask the same thing about all the supporting staff around the experts in your team.

I have a good idea of the shape of errors for a human based process, costing and the type of QA/QC team that has to be formed for it.

We have decades, if not centuries of experience working with humans, which LLMs are promising to be the equivalents/superiors of.

I think you and me, would both agree with the statement "use the right tool for the job".

However, the current hype cycle has created expectations of reliability from LLMs that drive 'Automated Intelligence' styled workflows.

On the other hand:

> part of initiatives to increase reliability of human teams

is a significantly more defensible uses of LLMs.

For me, most deployments die on the altar of error rates. The only people who are using them to any effect are people who have an answer to "what happens when it blows up" and "what is the cost if something goes wrong".

(there is no singular thread behind my comment. I think we probably have more in agreement than not, and its more a question of finding the precise words to declare the shapes we perceive.)

j45 44 days ago

Underrated comment.

So many applications of LLMs have even to start with deterministic brain when using a non-deterministic llm and then wonder why it’s not working.

jnpnj 43 days ago

it's strange to see software engineers using skills aka human description of small scripts instead of scripting things directly. often there were cli / tools / libraries to do what a skill does for many years. maybe it's culture issue, people who enjoy automation / devops / predictability will naturally help themselves, but other people just want to "delegate" and be done without trying.

vidarh 43 days ago

When people do that they are using skills wrong. The best way to use a skill is as a means to give targeted instructions on how to make use of cli / tools/ libraries, with the skill just covering the "squishy bits" that aren't easily encoded into something deterministic.

cortesoft 44 days ago

Everything you say is all possible, and in theory I agree with you.

However, I have been using spec-kit (which is basically this style of AI usage) for the last few months and it has been AMAZING in practice. I am building really great things and have not run into any of the issues you are talking about as hypotheticals. Could they eventually happen? Sure, maybe. I am still cautious.

But at some point once you have personally used it in practice for long enough, I can't just dismiss it as snake oil. I have been a computer programmer for over 30 years, and I feel like I have a good read on what works and what doesn't in practice.

wg0 44 days ago

We can build all the scaffolding around but I assure you that the LLMs aren't perfect rule following machines is the fundamental problem here and that would remain.

Give it a few more months and I'm sure you'll see some of what I see if not all.

I'm saying all the above having all sorts of systems tried and tested with AI leading me to say what I said.

cortesoft 44 days ago

I have been doing this for 6 months or so now, and I am not sure that even if you have a lot more experience than me that it would make your assessment more accurate, since that just means you have more experience with prior generations of the models. What I have experienced is that the AI has been getting better and better, and is making fewer and fewer mistakes.

Now, part of that is my advancements as well, as I learn how to specify my instructions to the AI and how to see in advance where the AI might have issues, but the advancements are also happening in the models themselves. They are just getting better, and rapidly.

The combination of getting better at steering the AI along with the AI itself getting better is leading me to the opposite conclusion you have. I have production systems that I wrote using spec-kit, that have been running in production for months, and have been doing spectacularly. I have been able to consistently add the new features that I need to, without losing any cohesion or adherence to the principals i have defined. Now, are there mistakes? Of course, but nothing that can't be caught and fixed, and not at a higher rate than traditional programming.

Quarrel 44 days ago

> LLMs aren't perfect rule following machines is the fundamental problem here

I kind of get what you're saying, but let us not pretend that SW engineers are perfect rule followers either.

Having a framework to work within, whether you are an LLM or a human, can be helpful.

suttontom 43 days ago

If someone regularly ignored critical instructions even though they were written down and had been told to follow them, that person would be fired.

scottg489 42 days ago

People are excused all the time for things because they are elevated in other areas. It's about their value as a whole and that's where we are with LLMs. They aren't perfect but they do plenty we can't which means they are worth using.

saidnooneever 44 days ago

i think it depend on your goals and also your preference / expectation how your experience with LLMs is. i dont mind if they hallucinate. even if i have mental model of code i wont write it myself perfectly either.

the only downside i see is getting out of practice, which is why for my passion projects i dont use it. work is just work and pressing 1 or 2 and having 'good enough' can be a fine way to get through the day. (lucky me i dont write production code ;D... goals...)

albedoa 43 days ago

> Give it a few more months

By that time, they will have realized immense value before seeing some of what you see. Sounds like an endorsement of spec-kit.

kergonath 43 days ago

> The slot machine can drop any hard requirement that you specifically in your AGENTS.md, memory.md or your dozens of skill markdowns. Pretty much guaranteed.

Indeed. That said, I’ve had some success with agent skills, but I use them to make the LLM aware of things it can do using specific external tools. I think it is a really bad idea to use this mechanism to enforce safety rules. We need good sandboxing for this, and promises from a model prone to getting off the rails is not a good substitute.

But I have taught my coding agent to use some ad hoc tools to gather statistics from a directory containing experimental data and things like that. Nobody is going to fine tune a LLM specifically for my field (condensed matter Physics) but using skills I still can make it useful work. Like monitoring simulations where some runs can fail for various reasons and each time we must choose whether to run another iteration or re-start from a previous point, based on eyeballing the results ("the energy is very strange, we should restart properly and flag for review if it is still weird", this sort of things). I don’t give too many rules to the agent, I just give it ways of solving specific problems that may arise.

selimthegrim 43 days ago

Do you have any information on skills you've found useful here?

kergonath 42 days ago

Not really, unfortunately. I took some inspiration from existing skills, mostly in the official GitHub repo https://github.com/agentskills/ . But mostly I had to come up with them myself. I try to use Claude to help but it was not that useful.

kajman 44 days ago

I hope the only reason people are pretending these markdown suggestions are a "workflow" is fear that a more structured approach will be obsolete by the time it's polished. I can't imagine the pace of innovation with the underlying models will stay like this forever.

I hope to see harnesses that will demand instead of ask. Kill an agent that was asked to be in plan mode but did not play the prescribed planning game. Even if it's not perfect, it'd have to better than the current regime when combined with a human in the loop.

raincole 44 days ago

Don't let the perfect be the enemy of the good. Of course we know the AGENTS.md and skills aren't 100% effective. But no, it doesn't mean that they're 0% effective.

peterbell_nyc 43 days ago

Helps if you both hand to original agent as strong guidance and then to an adversarial agent as a quality reviewer. The adversarial agent is more likely tro loop the work back if it fails the validation criteria.

I do find that just asking the same agent to do and check it's own work is not particularly reliable.

moomin 43 days ago

This is like saying a +5 sword is useless because you still miss on a one. We’ve got to think about expected outcomes. Because if ahe’s merging five solid PRs to your three, loudly complaining about the one she saw was rubbish and threw away.

chaostheory 44 days ago

I can see why this would seem to be “snake oil” logically. However, this approach does work in reality. Your comment just shows that you seem inexperienced with using generative AI.

Chris2048 43 days ago

> That leaves only one option not reliable but more reliable nevertheless: Human review and oversight.

Couldn't non-manual oversight also help e.g. sandboxes?

vdelpuerto 42 days ago

Agree. Human review and iteration. Hooks to gate certain decisions you dont want to happen again, like functional scars.

SubiculumCode 43 days ago

All these points apply to human devs as well. The test is not infallibility but magnitude

j16sdiz 43 days ago

A slot machine isn't snake oil.

Slot machine give you rewards when star aligns, snake oil never do :)

blitzar 44 days ago

All this said, I quite like the mental model of documenting a simple process, and I suspect our future ai overlords will find it useful that I have a series of md files that outline my preferences and processes for certain tasks.

I am not however going to share any of this with work colleagues and make myself redundant.