| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bluechair 559 days ago
	Has anyone seen how these constraints affect the quality of the output out of the LLM? In some instances, I'd rather parse Markdown or plain text if it means the quality of the output is higher.

7 comments

lolinder 559 days ago

Working with OpenAI's models I've found a very good strategy is to have two passes if you can afford the extra tokens: one pass uses a heavy model and natural language with markdown sections discussing the reasoning and providing a final natural language answer (ideally labeled clearly with a markdown header). The second pass can use a cheaper and faster model to put the answer into a structured output format for consumption by the non-LLM parts of the pipeline.

You basically use JSON schema mode to draw a clean boundary around the wishy-washy language bits, using the LLM as a preprocessor to capture its own output in a useful format.

link

mmoskal 559 days ago

It depends how fine-tuned the model is to JSON output.

Also, you need to tell the model the schema. If you don't you will get more weird tokenization issues.

For example, if the schema expects a JSON key "foobarbaz" and the canonical BPE tokenization is ["foobar", "baz"], the token mask generated by all current constrained output libraries will let the model choose from "f", "foo", "foobar" (assuming these are all valid tokens). The model might then choose "foo", and then the constraint will force eg. "bar" and "baz" as next tokens. Now the model will see ["foo", "bar", "baz"] instead of ["foobar", "baz"] and will get confused [0]

If the model knows from the prompt "foobarbaz" is one of the schema keys, it will generally prefer "foobar" over "foo".

[0] In modern models these tokens are related, because of regularization but they are not the same.

link

thot_experiment 559 days ago

YMMV, it's a negative effect in terms of "reasoning" but the delta isn't super significant in most cases. It really depends on the LLM and whether your prompt is likely to generate a JSON response to begin with, the more you have to coerce the LLM the less likely it is to generate sane input. With smaller models you more quickly end up at the edge of space where the LLM has meaningful predictive power and so the outputs start getting closer to random noise.

FWIW measured by me using a vibes based method, nothing rigorous just a lot of hours spent on various LLM projects. I have not used these particular tools yet but ollama was previously able to guarantee json output through what I assume is similar techniques and my partner and I worked previously on a jsonformer-like thing for oobabooga, another LLM runtime tool.

link

parthsareen 559 days ago

We’ve been keeping a close eye on this as well as research is coming out. We’re looking into improving sampling as a whole on both speed and accuracy.

Hopefully with those changes we might also enable general structure generation not only limited to JSON.

link

hackernewds 559 days ago

Who is "we"?

link

parthsareen 559 days ago

I authored the blog with some other contributors and worked on the feature (PR: https://github.com/ollama/ollama/pull/7900).

The current implementation uses llama.cpp GBNF grammars. The more recent research (Outlines, XGrammar) points to potentially speeding up the sampling process through FSTs and GPU parallelism.

link

mmoskal 559 days ago

If you want avoid startup cost, llguidance [0] has no compilation phase and by far the fullest JSON support [1] of any library. I did a PoC llama.cpp integration [2] though our focus is mostly server-side [3].

[0] https://github.com/guidance-ai/llguidance [1] https://github.com/guidance-ai/llguidance/blob/main/parser/s... [2] https://github.com/ggerganov/llama.cpp/pull/10224 [3] https://github.com/guidance-ai/llgtrt

link

HanClinto 559 days ago

I have been thinking about your PR regularly, and pondering about how we should go about getting this merged in.

I really want to see support for additional grammar engines merged into llama.cpp, and I'm a big fan of the work you did on this.

link

parthsareen 559 days ago

This looks really useful. Thank you!

link

netghost 559 days ago

Thank you for the details!

link

benreesman 559 days ago

I can say that I was categorically wrong about the utility of things like instructor.

It’s easy to burn a lot of tokens but if the thing you’re doing merits the cost? You can be a bully with it and while its never the best, 95% as good for zero effort is a tool in one’s kit.

link

crystal_revenge 559 days ago

There was a paper going around claiming that structured outputs did hurt the quality of the output, but it turns out their experiment setup was laughably bad [0].

It looks like, so long as you're reasonable with the prompting, you tend to get better outputs when using structure.

0. https://blog.dottxt.co/say-what-you-mean.html

link

coredog64 559 days ago

I’ve seen one case where structured output was terrible: OCR transcription of handwritten text in a form with blanks. You want a very low temperature for transcription, but as soon as the model starts to see multiple blank sequences, it starts to hallucinate that “” is the most likely next token.

link

nikolayasdf123 559 days ago

same here. I noticed that when you ask model to generate elaborate responses in natural text, and then come up with an answer, quality is orders of magnitude better, and something in line you would expect human-like reasoning.

asking LLM to directly generate JSON gives much worser results, similar to either random guess or intuition.

link