Working with OpenAI's models I've found a very good strategy is to have two passes if you can afford the extra tokens: one pass uses a heavy model and natural language with markdown sections discussing the reasoning and providing a final natural language answer (ideally labeled clearly with a markdown header). The second pass can use a cheaper and faster model to put the answer into a structured output format for consumption by the non-LLM parts of the pipeline.
You basically use JSON schema mode to draw a clean boundary around the wishy-washy language bits, using the LLM as a preprocessor to capture its own output in a useful format.
It depends how fine-tuned the model is to JSON output.
Also, you need to tell the model the schema. If you don't you will get more weird tokenization issues.
For example, if the schema expects a JSON key "foobarbaz" and the canonical BPE tokenization is ["foobar", "baz"], the token mask generated by all current constrained output libraries will let the model choose from "f", "foo", "foobar" (assuming these are all valid tokens). The model might then choose "foo", and then the constraint will force eg. "bar" and "baz" as next tokens. Now the model will see ["foo", "bar", "baz"] instead of ["foobar", "baz"] and will get confused [0]
If the model knows from the prompt "foobarbaz" is one of the schema keys, it will generally prefer "foobar" over "foo".
[0] In modern models these tokens are related, because of regularization but they are not the same.
YMMV, it's a negative effect in terms of "reasoning" but the delta isn't super significant in most cases. It really depends on the LLM and whether your prompt is likely to generate a JSON response to begin with, the more you have to coerce the LLM the less likely it is to generate sane input. With smaller models you more quickly end up at the edge of space where the LLM has meaningful predictive power and so the outputs start getting closer to random noise.
FWIW measured by me using a vibes based method, nothing rigorous just a lot of hours spent on various LLM projects. I have not used these particular tools yet but ollama was previously able to guarantee json output through what I assume is similar techniques and my partner and I worked previously on a jsonformer-like thing for oobabooga, another LLM runtime tool.
The current implementation uses llama.cpp GBNF grammars. The more recent research (Outlines, XGrammar) points to potentially speeding up the sampling process through FSTs and GPU parallelism.
If you want avoid startup cost, llguidance [0] has no compilation phase and by far the fullest JSON support [1] of any library. I did a PoC llama.cpp integration [2] though our focus is mostly server-side [3].
I can say that I was categorically wrong about the utility of things like instructor.
It’s easy to burn a lot of tokens but if the thing you’re doing merits the cost? You can be a bully with it and while its never the best, 95% as good for zero effort is a tool in one’s kit.
There was a paper going around claiming that structured outputs did hurt the quality of the output, but it turns out their experiment setup was laughably bad [0].
It looks like, so long as you're reasonable with the prompting, you tend to get better outputs when using structure.
I’ve seen one case where structured output was terrible: OCR transcription of handwritten text in a form with blanks. You want a very low temperature for transcription, but as soon as the model starts to see multiple blank sequences, it starts to hallucinate that “” is the most likely next token.
same here. I noticed that when you ask model to generate elaborate responses in natural text, and then come up with an answer, quality is orders of magnitude better, and something in line you would expect human-like reasoning.
asking LLM to directly generate JSON gives much worser results, similar to either random guess or intuition.
You basically use JSON schema mode to draw a clean boundary around the wishy-washy language bits, using the LLM as a preprocessor to capture its own output in a useful format.