| HN Mirror

What's interesting to me is that while it was obvious to all of us who came to think in the Unix Way, that insofar as composability, usage discoverability, and gobs of documentation in posts and man pages that are hugely represented in training corpora for LLMs, that the CLI is a great fit for LLM tool use, it seems only a recent trend to acknowledge this (and also the next hype wave, perhaps.)

Also interesting that while the big vendors are following this trend and are now trying to take a lead in it, they still suggest things like "but use a JSON schema" (the linked article does a bit of the same - acknowledging that incremental learning via `--help` is useful AND can be token-conserving (exception being that if they already "know" the correct pattern, they wouldn't need to use tokens to learn it, so there is a potential trade-off), they are also suggesting that LLMs would prefer to receive argument knowledge in json rather than in plain language, even though the entire point of an LLM is for understand and create plain language. Seemed dubious to me, and a part of me wondered if that advice may be nonsense motivated by desire to sell more token use. I'm only partially kidding and I'm still dubious of the efficacy.

* Here's a TL;DR for anyone who wants to skip the rest of this long message: I ran an LLM CLI eval in the form of a constructed CTF. Results and methodology are in the two links in the section linked: https://github.com/scottvr/jelp?tab=readme-ov-file#what-else

Anyhow... I had been experimenting with the idea of having --help output json when used by a machine, and came up with a simple module that exposes `--help` content as json, simply by adding a `--jelp` argument to any tool that already uses argparse.

In the process, I started testing, to see if all this extra machine-readable content actually improved performance, what it did to token use, etc. While I was building out test, trying to settle on legitimate and fair ways to come to valid conclusions, I learned of the OpenCLI schema draft, so I altered my `jelp` output to fit that schema, and set about documenting the things I found lacking from the schema draft, meanwhile settling to include these arg-related items as metadata in the output.

I'll get to the point. I just finished cleaning the output up enough to put it in a public repo, because my intent is to share my findings with the OpemCLI folks, in hopes that they'll consider the gaps in their schema compared to what's commonly in use, but at the same time, what came as a secondary thought in service of this little tool I called "jelp", is a benchmarking harness (and the first publishable results from it), the to me, are quite interesting and I would be happy if others found it to be and added to the existing test results with additional runs, models, or ideas for the harness, or criticism about the validity of the method, etc.

The evaluation harness uses constructed CLI fixtures arranged as little CLI CTF's, where the LLMs demonstrate their ability to use an unknown CLI be capturing a "flag" that they'll need to discover by using the usage help, and a trail of learned arguments.

My findings at first confirmed my intuitions, which was disappointing but unsurprising. When testing with GPT-4.1-mini, no manner of forcing them to receive info about the CLI via json was more effective than just letting them use the human-friendly plain English output of --help, and in all cases the JSON versions burned more tokens. I was able to elicit better performance by some measurements from 5.1-mini, but again the tradeoff was higher token burn.

I'll link straight to the part of the README that shows one table of results, and contains links to the LLM CLI CTF part of the repo, as well as the generated report after the phase-1 runs; all the code to reproduce or run your own variation is there (as well as the code for the jelp module, if there is any interest, but it's the CLI CTF eval that I expect is more interesting to most.)

https://github.com/scottvr/jelp?tab=readme-ov-file#what-else