Hacker News new | ask | show | jobs
by haeseong 10 days ago
The deeper reason agents write good Bonsai_term code is that the entire UI renders as plain text, so a screenshot test is just a diff the model can read and verify on its own. A GUI's visual state needs a vision model to inspect, but a TUI's output already lives in the agent's native modality, which closes the feedback loop for free.
2 comments

This isn't really much of an excuse given contemporary models though. My current game project has a GUI editor mode and it was not difficult at all whatsoever to set it up such that whenever I run a debug build of the game:

- It opens to the editor mode rather than the gameplay mode on launch

- It makes a .run/ directory next to the executable if one doesn't already exist

- It makes a timestamped directory within .run/ for this current debug run

- It automatically records stdout to stdout.txt, stderr to stderr.txt, and a crash.txt if the game crashes, in the directory for this run

- When the “take debug screenshot” function is invoked (which can be done by pressing F12), it saves a timestamped (based on time since executable launched) screenshot in the directory for this run

- Editor actions and 3D camera movements are recorded to playback.txt in the directory for this run

With all of this in place, I can do a debug build, run the game, do something in the editor, and take one or more screenshots where things went wrong. Then, Codex can see the log files and screenshots and try to diagnose the problem. When attempting to fix the problem, it can automatically recompile the debug build and rerun it with a launch option that plays back the latest recording file, which does the same sequence of editor actions/camera movements and takes screenshots at the same points in the process. Then it can compare this to the initial recorded run and see what needs to be fixed.

We could be having a GUI renaissance right now but for various primarily aesthetic reasons people are churning out TUIs, and personally I think it's a huge mistake.

for snapshot tests it seems better to diff a data representation such as some yaml string, than to diff UIs
The whole UI seems better for LLMs to consume and also displays nicely in-editor for humans. Test failures become failing screenshot tests essentially, which are really comfortable changes to review.
"displays nicely in-editor" is the whole point of yaml. The snapshots in the article are just yaml with additional useless ASCII characters
Yaml is just a serialization of some intermediate representation. The expect test output in the article is a full-fidelity (minus color) rendering of the UI. I would argue that the final app UI is both easier to read for humans and covers more code. As an example, a naive yaml test would likely not capture positions of all of the elements in the app and so you’d be able to silently introduce positioning bugs. On the flip side, if the yaml does include positioning information then it’s now substantially harder to read than the UI test and the signal from the test is compromised because readers will have a harder time understanding and be more likely to ignore.