Hacker News new | ask | show | jobs
by jerezzprime 33 days ago
I'd be interested in seeing actual agent benchmarks (eg CC or Copilot CLI with grep removed and this tool instead).

For example, I have explored RTK and various LSP implementations and find that the models are so heavily RL'd with grep that they do not trust results in other forms and will continually retry or reread, and all token savings are lost because the model does not trust the results of the other tools.

9 comments

I just put something in my global CLAUDE.md (under ~/.Claude) asking it to use the LSP instead of grep and have never had this issue since.
I've tried all sorts of tricks with Copilot CLI to get it to use LSP; project instructions, file extension instructions, but it just keeps forgetting. And says "yea you're right I should have used it" but then doesn't.
My q would have been this. Lsp solved this no?
can you share that prompt?
Full output below. There's other stuff in there, the "working with unfamiliar data or systems" is maybe slightly risky but (seemingly, after a week or two) much more token efficient and effective.

I also added the plugins directly to Claude Code: ty Plugin · claude-code-lsps · enabled vscode-langservers Plugin · claude-code-lsps · enabled vtsls Plugin · claude-code-lsps · enabled

  ~ cat ~/.claude/CLAUDE.md 
# Python Environment - ALWAYS use uv — never use pip, pip install, python, or python3 directly - Activate venv: `source .venv/bin/activate` - Install deps: `uv sync` - Add a dep: `uv add <package>` - Run scripts: `uv run <script.py>` - Run tools: `uvx <tool>`

# Long-running scripts - Any script, command, migration, data job, or test run that may take more than 2-3 seconds should emit regular status updates while it runs. - Prefer progress that is useful for diagnosing where time is going: current phase, item counts, batch numbers, elapsed time, retry/backoff state, or the external service being waited on. - For loops or batch jobs, log progress periodically rather than only at start/end; keep the cadence readable and avoid flooding output.

# Code Intelligence - LSP servers available: ty (Python), vtsls (JS/TS), vscode-langservers (HTML/CSS/JSON) - Use LSP for: - findReferences before any refactor - goToDefinition when navigating unfamiliar code - diagnostics after edits to catch type errors - grep/search is fine for simple lookups in small files - NEVER refactor without findReferences impact analysis first - After every edit, check LSP diagnostics before moving on

# Documentation - Context7 is available for up-to-date library docs - Use `ctx7 docs <libraryId> <query>` to fetch current documentation - Use `ctx7 library <name> <query>` to find a library ID first

# Working with unfamiliar data or systems - Prefer experimenting on real data over reasoning about it in the abstract. Your outputs are noticeably better when grounded in a concrete sample than when derived from minutes of speculation. - When a task involves parsing/processing/integrating with some external artifact (a report, an API response, a file format, a third-party tool's output), the FIRST step is to fetch or generate a real example and inspect it. Do not write code against an imagined shape. - Experiments must be non-destructive: read-only fetches, copies into a scratch dir, dry-run flags. Never mutate the user's real data to learn about it. - Before assuming you lack credentials, check the current working directory's `.env` file (and `.env.example` for hints about which keys exist) — API keys, tokens, and connection strings for the relevant service are very often already there. - If you cannot obtain real data on your own (auth genuinely missing, lives on another machine, behind a paywall, etc.), STOP and ask the user to provide a sample rather than guessing. - Example: asked to process an Amazon sales report, the first action is to fetch (or have the user paste) one actual report and look at its columns — not to draft a parser based on what such a report "probably" contains.

Codex CLI is quite happy running RTK. Well with GPT 5.5 xhigh anyway

One thing that irks me is that when it doesn't support eg. a cli flag of find, it gives an error message rather than sending the full output of the command instead. Then the agent wastes tokens retrying, or worse, doesn't even try because the prompting may make them afraid to not run commands without rtk

how effective is RTK for you? worth using?
I found judicial use of rtk on specific commands that you know can be improved with rtk, e.g. go test, pnpm test (vitest), etc. to be worthwhile, at least in CC. But using their default setup which is to prepend rtk to everything is more trouble than its worth. I have a custom-built hook that prepends rtk based on a hierarchical whitelist.

And you should disable the savings reporting feature since it’s worse than useless—it breaks sandboxing and always reports ~100% savings for me because rtk obviously doesn’t know about the head/tail the agent pipes into.

I can't find the relevant issues in their repo, but I've been somewhat skeptical of their tool over-reporting token savings and there are many issues to that effect in the repo.

I'm not likely to install it again in my latest configuration, instead applying some specific tricks to things like `make test` to spit out zero output exit on unsuccessful error codes, that sort of thing. Anecdotally, I see GPT-5.5 often automatically applying context limiting flags to the bash it writes :shrug:

I've had the same experience with RTK, where my agent got stuck in a loop with a faulty RTK command and could not escape it since RTK hard overwrites anything automatically. I've uninstalled it again for the time being.
I had better results with lean ctx and context mode than with rtk.
Wondering too
Yeah we're also interested in doing this, it's on the roadmap together with optimization of the prompt and descriptions so that models have an easier time using it.

Perhaps anecdotally: we do use this tool ourselves of course, and it's been working pretty well so far. Anthropic models call it and seem to trust the results.

I forced Claude to have a global memory for RTK and my own AI memory system (GuardRails) which it happily uses both, the only times it doesnt use GuardRails is if I dont mention it at all, otherwise it always uses RTK unless RTK falls apart running a tool it does not support.
Token savings is more and more important, but it also important if the agent trusts the result and stops searching. it should measure the full agent loop instead of just the search output
>so heavily RL'd with grep

At least codex listens to me telling it to use rg instead of grep, cause grep is often so slow. But when adding rtk it uses grep through rtk which is kind of annoying.

I think the best bet is to use some kind of proxy so when the model calls grep, you intercept the call, use other tool to search and give back the results to the model.
True. Just have the interface that behaves like grep and the output is as expected like grep but internally: indexed, ranked, ...

So the model trusts the output because it is grep :D

I tried to use rtk once and the model got stuck just running and rerunning and rerunning in a loop, until I killed it. I have no idea what happened.
Hey, this is something we're actively working on, but this is hard (and expensive) to do well across harnesses/models. The grep pretraining thing is very interesting though, I've noticed the same. E.g. Sonnet 4.6 seems to trust semble but Opus 4.7 less so. I'm hoping we can quantitatively test this and improve it when we have proper benchmarks for this as well. If you do have any feedback though let me know!