Hacker News new | ask | show | jobs
by mettamage 65 days ago
So…

What is a harness? People have been talking about it and couldn’t glean what it is

2 comments

AI models on their own are raw, undirected, and inherently probabilistic. A "harness" acts as a control layer wrapped around the model, designed to steer it toward deterministic outcomes. It achieves this by equipping the model with actionable tools like web search or file I/O, and by orchestrating an evaluation loop that runs until an acceptable result is produced. (various analogies work here - an astronaut and a space suit, a rocket and the launch pad/mission control, okay I'm out of analogies that aren't car engines)

You can see this in practice by looking at the leaked Claude Code source code. It is a harness around Anthropic's model built for writing code. It relies on heavily engineered (and sometimes brittle) steering mechanisms. These range from highly specific situational prompts to deterministic, hard-coded logic that executes based on the model's output.

Getting a harness right is incredibly hard and feels like whack-a-mole at times.

I can totally agree on the harness part. When I first set out to create a Cursor killer nearly 3 years ago, I built LLM tools, but when I didn't know then has I tried to wrap the LLM's brain around the tools when it needed to be the other way around.

Looks me off an on three years to realize I was doing it backwards. Agent was originally born after I re-wrote CloneTool, a more generic Disk Cloning too with an SMAppService Launch Daemon.

After I completed CloneTool, I was like mmmmm what is I connected an LLM to the Daemon? It rattled of 50 things it could do and it had no knowledge of this anywhere in the harness, system prompt or tools. It simply had figured out its environment on its own.

I never ran Agent under that scenario it definitely has a hardness now. And yes getting the hardness right is a number one challenge and once you do get it working good with most LLMs out of the box, you try not to change it because that sweet spot is hard to come by. Not to say it never gets tweaked but the further in you go, the more you chringe on a change that may break it.

A while loop, some prompts basically amounting to "this is how you format a system call" and "make no mistakes", there's also a regex + executor for detecting and executing system calls.
you forgot the memory model. Which is an absolutely essential and hard to design part of the agent.
and occasionally, UI prompts with QA.
Memory model? I would not want agent to remember previous conversations.
Ever? You would not want to accumulate any useful context at all?
Maybe in the future but with the current models I found the constantly accessible memories to be an impediment. I don't want models to record and repeat mistakes or suboptimal strategies.
Gemini (just in the browser) has been really bad about conflating a bunch of similar projects. It remembers "oh, you have a home server that does XYZ", so my new home server that's doing ZYX instead must be the same system.