| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mark_undoio 346 days ago

I'm fascinated by this paper because it feels like it could be a good analogue for "can LLMs handle a stateful, text-based tool". A debugger is my particular interest but there's no reason why it couldn't be something else.

To use a debugger, you need:

* Some memory of where you've already explored in the code (vs rooms in a dungeon)

* Some wider idea of your current goal / destination (vs a current quest or a treasure)

* A plan for how to get there - but the flexibility to adapt (vs expected path and potential monsters / dead ends)

* A way for managing information you've learned / state you've viewed (vs inventory)

Given text adventures are quite well-documented and there are many of them out there, I'd also like to take time out to experiment (at some point!) with whether presenting a command-line tool as a text adventure might be a useful "API".

e.g. an MCP server that exposes a tool but also provides a mapping of the tools concepts into dungeon adventure concepts (and back). If nothing else, the LLM's reasoning should be pretty entertaining. Maybe playing "make believe" will even make it better at some things - that would be very cool.

2 comments

alwa 346 days ago

That’s a delightful concept to think about! I’m not sure what conceptual information the translation layer would add to the LLM’s internal representation of the state space.

But the broader concept of asking it to translate something structurally to a different domain, then seeing how the norms of that domain cause it to manipulate the state differently… that tickles my fancy for sure. Like you said, it sounds cool even in an art-project sense just to read what it says!

link

vladimirralev 346 days ago

I've seen both replit and cline agents iteratively debug hard problem with massive amount of log lines. They can do it already.

link

mark_undoio 346 days ago

That's the thing though - they're using logs. My theory is that LLMs are intrinsically quite good at that because they're good at sifting text.

Getting then to drive something like a debugger interface seems harder from my experience (although the ChatDBG people showed some success - my experiments did too, but it took the tweaks I described).

My experiments are with Claude Opus 4, in Claude Code, primarily.

link

throwaway81523 346 days ago

Look also at Delta Debugging which didn't need an LLM.

link