| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by andy_xor_andrew 1066 days ago

Here's one thing I don't get.

Why all the rigamarole of hoping you get a valid response, adding last-mile validators to detect invalid responses, trying to beg the model to pretty please give me the syntax I'm asking for...

...when you can guarantee a valid JSON syntax by only sampling tokens that are valid? Instead of greedily picking the highest-scoring token every time, you select the highest-scoring token that conforms to the requested format.

This is what Guidance does already, also from Microsoft: https://github.com/microsoft/guidance

But OpenAI apparently does not expose the full scores of all tokens, it only exposes the highest-scoring token. Which is so odd, because if you run models locally, using Guidance is trivial, and you can guarantee your json is correct every time. It's faster to generate, too!

6 comments

zarzavat 1066 days ago

It’s like the story of the brown M&Ms[0]. If the model is returning semantically correct data, you would hope that it can at least get the syntax correct. And if it can’t then you ought to throw the response away anyway.

Also I believe that such a method cannot capture the full complexity of TypeScript types.

[0] https://www.snopes.com/fact-check/brown-out/

link

tonyonodi 1066 days ago

That's a great analogy! I'd been wondering for a while whether that's a problem with this approach; to be honest I still don't know whether it is, so it would be good to see someone test it empirically.

link

rolisz 1066 days ago

> when you can guarantee a valid JSON syntax by only sampling tokens that are valid? Instead of greedily picking the highest-scoring token every time, you select the highest-scoring token that conforms to the requested format.

Yes, you can guarantee a syntactically correct JSON that way, but will it be a semantically correct? If the model really really really wanted to put another token there, but you are forcing it to put a {, maybe the following generated text won't be as good.

I'm not sure, I'm just wondering out loud.

link

geysersam 1066 days ago

Well, if the output doesn't conform to the format it's useless. If the model can't produce good and correct output then it's simply not up to the task.

link

waffletower 1065 days ago

In my experience, LLM responses result in a fair distribution of outputs that do have semantically useful outputs but do not precisely adhere to the requested format. If I chose to use a strongly typed language for LLM parsing, perhaps I would be tempted to eliminate complexity and simply throw structural outliers away, and explain to the suits that a certain percentage of our queries/expenses are unusable. Instead, more sophisticated coercion techniques could be applied instead to increase output utilization.

link

IanCal 1066 days ago

That really strongly depends on your task. Lots of tasks can accept a non-zero failure rate in return for better results on the successful cases. I'm not sure I can think of any off the top of my head where you'd use a LLM and can never deal with a failure, particularly if you're using an external service where you're guaranteed to have to deal with errors or downtime at some point.

link

donfotto 1066 days ago

I agree that sampling only valid tokens is a very promising approach.

I experimented a bit with finetuning open source LLMs for JSON parsing (without guided token sampling). Depending on one's use case, 70B parameters might be an overkill. I've seen promising results with much much smaller models. Finetuning a small model combined with guided token sampling would be interesting.

Then again, finetuning is perhaps not perfect for very general applications. When you get input that you didn't anticipate in your training dataset, you're in trouble.

link

csomar 1066 days ago

The LLM will be able to handle more complex scenarios. I could imagine a use-case: If you are ordering from a self-vending machine, instead of having to go through the whole process you just say your order out loud. You can say, for example, a couple chocolate bars and the LLM tries to guess from inventory.

Of course, if you are on the web, it makes no sense. It is much easier to use the mouse to click on a couple of items.

link

Scaevolus 1064 days ago

Llama.cpp recently added grammar based sampling, which constraints token selection to follow a rigid format like you describe.

https://github.com/ggerganov/llama.cpp/pull/1773

link

CGamesPlay 1066 days ago

OpenAI doesn’t expose this information because it makes it vastly easier to train your model off theirs.

link