Hacker News new | ask | show | jobs
by prerok 28 days ago
Well, there is also a big difference that it will not learn over time. If a junior makes a mistake and it will not be caught in time they will automatically learn.

With LLMs we have to teach them about their mistakes with adapting the harness and then hoping it will stick.

What I also find particularly hilarious about this whole thing is that we were always complaining about how difficult it is to put our tacit knowledge into words and therefore couldn't produce clear instructions for juniors to quickly ramp up. Now we are trying to do just that. I think we will find, just as we did in the past, that it's not possible. I do think a good harness improves results but LLMs will not be able to reach senior levels. Just my 2c.

5 comments

> Well, there is also a big difference that it will not learn over time.

My work is in tick-tock loop of learning - learn without modifying weights, demonstrate learnings to human, but then lock it back in (accumulate and spread).

This looks less like training and more like mentoring.

Getting a human to mentor an agent is a hard UX task, but the learning loop is not a technological problem anymore.

We can only get a tick once a week, no matter how many tocks we can do an hour.

Maybe someone knows, but it seems like the model used to be called the model, and the thing using a model (handling prompts and context and tool calling and feeding the model) used to be called the agent.

Are we now calling the model the agent and the agent the harness?

The nomenclature that makes sense for me is that the agent is the combination of the harness and the model. The model provides text-completion, the harness provides the loop around it, and the agent is the full structure of both.

However, nomenclature evolves over time. I recall (perhaps falsely) that The Cloud was specifically a term for elastic on-demand provider-managed compute/storage/network. Over time, it came to mean many other things. e.g. Salesforce Data Cloud.

I imagine if you step away from this for a year and come back, an agent will be something entirely different, perhaps a robotic horse, and a harness will be your saddle on the horse. Who knows?

The Cloud originally just meant servers on someone else's network; it came from flowchart diagrams in the 70s.
That’s basically how I always knew it. On a Visio diagram of your network, the thing on the other side of your router was literally a cloud.

So if someone asked where your CRM was, and you weren’t doing something local like Dynamics (…vomit), well that thing was “over here, in the cloud”.

I worked at a classic "cloud" providing company. We called "the fog". That was more descriptive of the seemingly non-deterministic nature of the overall system(s).
The harness isn't either of those; the harness is quite literally a harness, giving the model/agent sensors and actuators (aka "skills") to interact with its environment. Compare with e.g. the Power Loader from Aliens: https://www.deviantart.com/pynion/art/Aliens-Power-Loader-11...

The model is still the model, and the agent is still the user<->model interface.

Funny. harness = skills is one I hadn’t even heard yet.

But given the wide variety of mutually exclusive answers here, maybe you can get away with that.

Here's how I see it: "Agent" isn't really describing a component, it's describing how you use the LLM. You have the model, and you have a harness around it that might be minimal or might have more features. If it's directly responding to user actions then it's not an agent, if it's semi-autonomous then it's an agent. (Yes this line is sometimes fuzzy.)
There are new buzz words every two months. Remeber yesterday when everbody was throwing around RAG?
RAG died to better AIs. Turns out that a sufficiently advanced agentic model can do more than what RAG does with nothing but a grep tool over a pile of text files.
I think if the dream of semantic search from vector embeddings had worked out as well as people had hoped then "grep over a bunch of text" would have some significant disadvantages.

But in practice I never saw anyone crack the embedding-generation-and-comparison problems well enough to actually get better results than grep for things like "find similar code and see what it does."

(You also don't need that advanced a model to use "grep over a pile of files", but the models today can run MUCH faster than GPT 3.5/4 were running over the APIs back then, making "summarize all five hundred of these matches from those files" much more usable.)

I’ve had very good luck having my system search for available tool functions with natural language (ultimately against Qdrant). I’m surprised to hear that people are trying to grep files, instead.
People? No, that's what AI agents themselves do.

There are theoretical gains from using a vector search engine in an agentic loop, but grep is the lowest common denominator of agentic search.

Part of the positive aspect here is that if I have a junior dev who learns a lesson today, maybe they and their immediate peers learn it, but it won’t be all my junior devs and it certainly won’t be junior devs at other companies.

With models, there’s no reason that a model error in company A can’t be fixed for all of company A, and companies B-ZZZ.

Here's some reasons:

- The mistakes made aren't "model errors" typically; you can't point to some aspect of a model and say that was at fault.

- You can't submit a bug report to a model provider for a mistake made when using a model, and you can't* submit training data to be incorporated in the next release of the model.

- If you own your model and are training it yourself, other companies won't see a benefit.

- You probably need to fine-tune models for each specific role and context so you don't just diffuse all the learning; lessons learned won't be applied to all your junior dev models, but you don't want them all to learn something specific about product A.

- If you take this to its logical conclusion you will invent a new role of "model manager" and associated hierarchy to ensure that training is effective and timely, and that company-wide lessons are applied across the model fleet.

- This is all impractically expensive.

If it were practical to have LLMs learn as they go, that would be a bit of a shake-up, in much the same way that a house fire is a bit of a warm up.

* Well, everything you submit to a model provider is likely winding up in training data anyway, no matter what your contract says.

Why does company A want the model to get fixed for companies B-ZZZ?
Because they want the fixes that B-ZZZ learned about and they may not be able to avoid letting the model know that it made an error, unless they suddenly go silent to the model about what happened.
New job under AI. Go work for company A, but use it to write programs that use Company B's stack, but make sure to overcomplicate everything and "correct" the LLM into doing the wrong thing. Make sure Company B gets the results of your "improvements".
why would we let a competitor have the same advantages?

and getting an improvement to some random unrelated 3rd party give us...?

maybe -- and it's a big maybe -- their improvements could help us to. but that's not a given.

They learn between model iterations. You're right, it isn't the same thing as Junior developers' competence improving with experience - the current model's weaknesses are locked in. But it does mean that much of the Junior level thinking and mistakes will be outgrown by successor models.
But they don't retain anything from your on-the-job training. The next model iteration is yet another junior fresh out of college, and knows nothing about the painful training procedures its predecessor put you through.
Yes... but the next session with the same model is yet another junior fresh out of college that knows nothing about the painful lessons the last session put you through ten minutes ago, either.
Skill issue?

Nothing prevents an LLM agent from writing a bunch of "notes to self" and using that. And the next model from picking those notes up and using them. Coding agents already do some of that natively.

Hell, we might eventually get an LLM to say "wow the old AI was an incompetent idiot" after reviewing all the notes and session logs. That's how we know we reached human parity!

The context window limit prevents it, for one.
Only if you are incapable of fitting both the task and task-relevant data into it. And 1M contexts are mainstream by now.

Context size is a capacity limit, not a showstopper.

Surely you just copy the prompt over and it immediately knows all the same on the job stuff that the previous model did.
The point is the current model also knows nothing about the “on the job stuff”.

It’s extremely difficult(impossible?) to include every bit of relevant domain knowledge into “the prompt”

> If a junior makes a mistake and it will not be caught in time they will automatically learn.

I think this sentiment applies well to junior software engineers (with mentorship). But imagine the much larger swaths of entry level employees in operations, support, or sales functions. When you have a 400 person team with 20% annual turnover (since people move in / out of entry level jobs frequently), the management + training + monitoring becomes a huge challenge.

I think the typical HN sentiment of "llms aren't deterministic" fails to take into account how non-deterministic giant groups of people are. Every group of 10 people typically needs a manager. And every 10 managers needs another manager. By comparison the engineering work on dialing in your LLM guardrails feels pretty worthwhile.

Ya my experience is that many people honestly don't produce output as good as AI. An educated (formally or informally), experienced person who is putting forward good effort is better than AI, but I do know people who honestly just produce results having AI do it for them.