| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by skerit 112 days ago

> Then a brick hits you in the face when it dawns on you that all of our tools are dumping crazy amounts of non-relevant context into stdout thereby polluting your context windows.

I've found that letting the agent write its own optimized script for dealing with some things can really help with this. Claude is now forbidden from using `gradlew` directly, and can only use a helper script we made. It clears, recompiles, publishes locally, tests, ... all with a few extra flags. And when a test fails, the stack trace is printed.

Before this, Claude had to do A TON of different calls, all messing up the context. And when tests failed, it started to read gradle's generated HTML/XML files, which damaged the context immensely, since they contain a bunch of inline javascript.

And I've also been implementing this "LLM=true"-like behaviour in most of my applications. When an LLM is using it, logging is less verbose, it's also deduplicated so it doesn't show the same line a hundred times, ...

> He sees something goes wrong, but now he cut off the stacktraces by using tail, so he tries again using a bigger tail. Not satisfied with what he sees HE TRIES AGAIN with a bigger tail, and … you see the problem. It’s like a dog chasing its own tail.

I've had the same issue. Claude was running the 5+ minute test suite MULTIPLE TIMES in succession, just with a different `| grep something` tacked at the end. Now, the scripts I made always logs the entire (simplified) output, and just prints the path to the temporary file. This works so much better.

6 comments

majewsky 112 days ago

> Claude is now forbidden from using `gradlew` directly, and can only use a helper script we made. It clears, recompiles, publishes locally, tests, ... all with a few extra flags. And when a test fails, the stack trace is printed.

I think my question at this point is what about this is specific to LLMs. Humans should not be forced to wade through reams of garbage output either.

kimixa 112 days ago

Humans have the ability to ignore and generally not remember things after a short scan, prioritize what's actually important etc. But to an LLM a token is a token.

There's attempts at effectively doing something similar with analysis passes of the context - kinda what things like auto-compaction is doing - but I'm sure anyone who has used the current generation of those tools will tell you they're very much imperfect.

pennomi 112 days ago

The “a token is a token” effect makes LLMs really bad at some things humans are great at, and really good at some things humans are terrible at.

For example, I quickly get bored looking through long logfiles for anomalies but an LLM can highlight those super quickly.

dcrazy 112 days ago

Isn’t the purpose of self attention exactly to recognize the relevance of some tokens over others?

kimixa 112 days ago

That may help with tokens being "ignored" while still being in the context window, but not context window size costs and limitations in the first place.

rstuart4133 112 days ago

> I think my question at this point is what about this is specific to LLMs. Humans should not be forced to wade through reams of garbage output either.

Beware I'm a complete AI layman. All this is from background reading of popular articles. It may well be wrong. It's definitely out of date.

It has to do with how the attention heads work. The attention heads (the idea originated from the "Attention is all you need" paper, arguably the single most important AI paper to date), direct the LLM to work on the most relevant parts of the conversation. If you want a human analogue, it's your attention heads that are tacking the interesting points in a conversation.

The original attention heads output a relevance score for every pair of words in the context window. Thus in "Time flies like an arrow", it's the attention heads that spot the word "Time" is very relevant to "arrow", but not "flies". The implication of this is an attention head does O(N*N) work. It does not scale well to large context windows.

Nonetheless, you see claims of "large" context windows the LLMs marketing. (Large is in quotes, because even a 1M context window begins to feel very cramped in a write / test / fix loop.) But a 1M context-window would require a attention head requiring a 1 trillion element matrix. That isn't feasible. The industry even has a name for the size of the window they give in their marketing: the Effective Context Window. Internally they have another metric that measures the real amount of compute they throw at attention: the Physical Context Window. The bridge between the two is some proprietary magic that discards tokens in the context window that are likely to be irrelevant. In my experience, that bridge is pretty good at doing that, where "pretty good" is up to human standards.

But eventually (actually quickly in my experience), you fill up even the marketed size of the context window because it is remembering every word said, in the order they were said. If it reads code it's written to debug it, it appears twice in the context window. All compiler and test output also ends up there. Once the context window fills up they take drastic action, because it like letting malloc fail. Even reporting a malloc failure is hard because it usually needs more malloc to do the reporting. Anthropic calls it compacting. It throws away 90% of your tokens. It turns your helpful LLM into a goldfish with dementia. It is nowhere near as good as human is at remembering what happened. Not even close.

keeda 112 days ago

In my experience, it's the old time-invested vs time-saved trade off. If you're not looking at these reams of output often enough, the incentive to figure out all the flags and configs for verbosity to write these script is lower: https://xkcd.com/1205/

And because these issues are often sporadic, doing all this would be an unwanted sidequest, so humans grit their teeth and wade through the garbage manually each time.

With LLMs, the cost is effectively 0 compared to a human, so it doesn't matter. Have them write the script. In fact, because it benefits the LLM by reducing context pollution, which increases their accuracy, such measures should be actively identified and put in place.

adammarples 112 days ago

Lots of tools have a --quiet or --output json type option, which is usually helpful

ViktorEE 112 days ago

The way I've solved this issue with a long running build script is to have a logging scripts which redirects all outputs into a file and can be included with ``` # Redirect all output to a log file (re-execs script with redirection) source "$(dirname "$0")/common/logging.sh" ``` at the start of a script.

Then when the script runs the output is put into a file, and the LLM can search that. Works like a charm.

quintu5 112 days ago

This has been my exact experience with agents using gradle and it’s beyond frustrating to watch. I’ve been meaning to set up my own low-noise wrapper script.

This post just inspired me to tackle this once and for all today.

nrds 110 days ago

> This works so much better.

Pi coding agent does this by default with all outputs but Claude (all versions tested, including opus 4.6) just completely ignores this capability. Even when the tool output explicitly tells the agent that the full output is saved in a particular file, Claude reruns the command.

petedoyle 112 days ago

Wow, I'd love to do this. Any tips on how to build this (or how to help an LLM build this), specifically for ./gradlew?

esafak 112 days ago

How is it forbidden? I tell agents to use my wrappers in AGENTS but they ignore it half the time and use the naked tool.

Squid_Tamer 112 days ago

If you get desperate, I've given my agent a custom $PATH that replaces the forbidden tools with shims that either call the correct tool, or at least tell it what to do differently.

~/agent-shims/mvn:

    #!/bin/bash
    echo "Usage of 'mvn' is forbidden. Use build.sh or run-tests.sh"

That way it is prevented from using the wrong tools, and can self-correct when it tries.

simsla 112 days ago

Permissions scoping

esafak 112 days ago

Then they attempt to download the missing tool or write a substitute from scratch. Am I the only one who runs into this??