| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jerezzprime 33 days ago
	I'd be interested in seeing actual agent benchmarks (eg CC or Copilot CLI with grep removed and this tool instead). For example, I have explored RTK and various LSP implementations and find that the models are so heavily RL'd with grep that they do not trust results in other forms and will continually retry or reread, and all token savings are lost because the model does not trust the results of the other tools.

9 comments

AussieWog93 33 days ago

I just put something in my global CLAUDE.md (under ~/.Claude) asking it to use the LSP instead of grep and have never had this issue since.

link

turbine401 33 days ago

I've tried all sorts of tricks with Copilot CLI to get it to use LSP; project instructions, file extension instructions, but it just keeps forgetting. And says "yea you're right I should have used it" but then doesn't.

link

gigatexal 33 days ago

My q would have been this. Lsp solved this no?

link

yakbarber 33 days ago

can you share that prompt?

link

AussieWog93 33 days ago

Full output below. There's other stuff in there, the "working with unfamiliar data or systems" is maybe slightly risky but (seemingly, after a week or two) much more token efficient and effective.

I also added the plugins directly to Claude Code: ty Plugin · claude-code-lsps · enabled vscode-langservers Plugin · claude-code-lsps · enabled vtsls Plugin · claude-code-lsps · enabled

  ~ cat ~/.claude/CLAUDE.md

# Python Environment - ALWAYS use uv — never use pip, pip install, python, or python3 directly - Activate venv: `source .venv/bin/activate` - Install deps: `uv sync` - Add a dep: `uv add <package>` - Run scripts: `uv run <script.py>` - Run tools: `uvx <tool>`

# Long-running scripts - Any script, command, migration, data job, or test run that may take more than 2-3 seconds should emit regular status updates while it runs. - Prefer progress that is useful for diagnosing where time is going: current phase, item counts, batch numbers, elapsed time, retry/backoff state, or the external service being waited on. - For loops or batch jobs, log progress periodically rather than only at start/end; keep the cadence readable and avoid flooding output.

# Code Intelligence - LSP servers available: ty (Python), vtsls (JS/TS), vscode-langservers (HTML/CSS/JSON) - Use LSP for: - findReferences before any refactor - goToDefinition when navigating unfamiliar code - diagnostics after edits to catch type errors - grep/search is fine for simple lookups in small files - NEVER refactor without findReferences impact analysis first - After every edit, check LSP diagnostics before moving on

# Documentation - Context7 is available for up-to-date library docs - Use `ctx7 docs <libraryId> <query>` to fetch current documentation - Use `ctx7 library <name> <query>` to find a library ID first

# Working with unfamiliar data or systems - Prefer experimenting on real data over reasoning about it in the abstract. Your outputs are noticeably better when grounded in a concrete sample than when derived from minutes of speculation. - When a task involves parsing/processing/integrating with some external artifact (a report, an API response, a file format, a third-party tool's output), the FIRST step is to fetch or generate a real example and inspect it. Do not write code against an imagined shape. - Experiments must be non-destructive: read-only fetches, copies into a scratch dir, dry-run flags. Never mutate the user's real data to learn about it. - Before assuming you lack credentials, check the current working directory's `.env` file (and `.env.example` for hints about which keys exist) — API keys, tokens, and connection strings for the relevant service are very often already there. - If you cannot obtain real data on your own (auth genuinely missing, lives on another machine, behind a paywall, etc.), STOP and ask the user to provide a sample rather than guessing. - Example: asked to process an Amazon sales report, the first action is to fetch (or have the user paste) one actual report and look at its columns — not to draft a parser based on what such a report "probably" contains.

link

nextaccountic 33 days ago

Codex CLI is quite happy running RTK. Well with GPT 5.5 xhigh anyway

One thing that irks me is that when it doesn't support eg. a cli flag of find, it gives an error message rather than sending the full output of the command instead. Then the agent wastes tokens retrying, or worse, doesn't even try because the prompting may make them afraid to not run commands without rtk

link

aleksiy123 33 days ago

how effective is RTK for you? worth using?

link

oefrha 33 days ago

I found judicial use of rtk on specific commands that you know can be improved with rtk, e.g. go test, pnpm test (vitest), etc. to be worthwhile, at least in CC. But using their default setup which is to prepend rtk to everything is more trouble than its worth. I have a custom-built hook that prepends rtk based on a hierarchical whitelist.

And you should disable the savings reporting feature since it’s worse than useless—it breaks sandboxing and always reports ~100% savings for me because rtk obviously doesn’t know about the head/tail the agent pipes into.

link

philipbjorge 33 days ago

I can't find the relevant issues in their repo, but I've been somewhat skeptical of their tool over-reporting token savings and there are many issues to that effect in the repo.

I'm not likely to install it again in my latest configuration, instead applying some specific tricks to things like `make test` to spit out zero output exit on unsuccessful error codes, that sort of thing. Anecdotally, I see GPT-5.5 often automatically applying context limiting flags to the bash it writes :shrug:

link

Bibabomas 33 days ago

I've had the same experience with RTK, where my agent got stuck in a loop with a faulty RTK command and could not escape it since RTK hard overwrites anything automatically. I've uninstalled it again for the time being.

link

DeathArrow 33 days ago

I had better results with lean ctx and context mode than with rtk.

link

maille 33 days ago

Wondering too

link

stephantul 33 days ago

Yeah we're also interested in doing this, it's on the roadmap together with optimization of the prompt and descriptions so that models have an easier time using it.

Perhaps anecdotally: we do use this tool ourselves of course, and it's been working pretty well so far. Anthropic models call it and seem to trust the results.

link

giancarlostoro 33 days ago

I forced Claude to have a global memory for RTK and my own AI memory system (GuardRails) which it happily uses both, the only times it doesnt use GuardRails is if I dont mention it at all, otherwise it always uses RTK unless RTK falls apart running a tool it does not support.

link

Riany 33 days ago

Token savings is more and more important, but it also important if the agent trusts the result and stops searching. it should measure the full agent loop instead of just the search output

link

carlmr 33 days ago

>so heavily RL'd with grep

At least codex listens to me telling it to use rg instead of grep, cause grep is often so slow. But when adding rtk it uses grep through rtk which is kind of annoying.

link

DeathArrow 33 days ago

I think the best bet is to use some kind of proxy so when the model calls grep, you intercept the call, use other tool to search and give back the results to the model.

link

economyballoon 31 days ago

True. Just have the interface that behaves like grep and the output is as expected like grep but internally: indexed, ranked, ...

So the model trusts the output because it is grep :D

link

stavros 32 days ago

I tried to use rtk once and the model got stuck just running and rerunning and rerunning in a loop, until I killed it. I have no idea what happened.

link

Bibabomas 33 days ago

Hey, this is something we're actively working on, but this is hard (and expensive) to do well across harnesses/models. The grep pretraining thing is very interesting though, I've noticed the same. E.g. Sonnet 4.6 seems to trust semble but Opus 4.7 less so. I'm hoping we can quantitatively test this and improve it when we have proper benchmarks for this as well. If you do have any feedback though let me know!

link