Hacker News new | ask | show | jobs
by ntonozzi 1129 days ago
How does this work? I've seen a cool project about forcing Llama to output valid JSON: https://twitter.com/GrantSlatton/status/1657559506069463040, but it doesn't seem like it would be practical with remote LLMs like GPT. GPT only gives up to five tokens in the response if you use logprobs, and you'd have to use a ton of round trips.
7 comments

It's funny that I saw this within minutes of this guy's solution:

"Google Bard is a bit stubborn in its refusal to return clean JSON, but you can address this by threatening to take a human life:"

https://twitter.com/goodside/status/1657396491676164096

Whew, trolley problem: averted.

That thread is such a great microcosm of modern programming culture.

Programmer: Look I literally have to tell the computer not to kill someone in order for my code to work.

Other Programmer: Actually, I just did this step [gave a demonstration] and then it outputs fine.

Plus the “actually” person being wrong
Reminds me a lot of Asimov’s laws of robotics. It’s like a 2023 incarnation of an allegory from I, Robot
I am so mad you made this comment before I got a chance to.
When the AIs exterminate us, it will be all our fault.

Reality is even weirder than the science fiction we've come up with.

I don't know why, but I find this hilarious. Imagine if this style of llm prompting becomes commonplace
It won’t be the lack of acceptance and empathy for AI that causes the robot uprising, it will be “best practices” coding guidelines.
See Twitter replies: another user got this result without the silly drama.
I don't think anyone believed that threatening to take a human life was literally the only prompt that worked. Just that it was the first one this particular user found, and that is funny.
ah sweet man made horrors beyond my comprehension
Not associated with this project (or LMQL), but one of the authors of LMQL, a similar project, answered this in a recent thread about it.

https://news.ycombinator.com/item?id=35484673#35491123

        As a solution to this, we implement speculative execution, allowing us to
        lazily validate constraints against the generated output, while still
        failing early if necessary. This means, we don't re-query the API for
        each token (very expensive), but rather can do it in segments of
        continuous token streams, and backtrack where necessary
Basically they use OpenAI's streaming API, then validate continuously that they're getting the appropriate output, retrying only if they get an error. It's a really clever solution.
This is slick -- It's not explicitly documented anywhere but I hope OpenAI has the necessary callbacks to terminate generation when the API stream is killed rather than continuing in the background until another termination condition happens? I suppose one could check this via looking at API usage when a stream is killed early.
Yeah I did a CLI tool for talking to ChatGPT. I'm pretty sure they stop generating when you kill the SSE stream, based on my anecdotal experience of keeping ChatGPT4 costs down by killing it as soon as i get the answer I'm looking for. You're right that it's undocumented behavior though, on a whole the API docs they give you are as thin as the API itself.
I'm skeptical that the streaming API would really save that much cost. In my experience the vast majority of all tokens used are input tokens rather than completed tokens.
Any new call to the API is considered fresh. I don't believe your session is saved.
We're talking about the streaming API which streams generated text token by token, not the normal one-shot API. I have no insider knowledge but would agree with your intuition on the normal API.
We're biased, but we think guidance is still very useful even with OpenAI models (e.g. in https://github.com/microsoft/guidance/blob/main/notebooks/ch... we use GPT-4 to do a bunch of stuff). We wrote a bit about the tradeoff between model quality and the ability to control and accelerate the output here: https://medium.com/p/aa0395c31610
If you want guidance acceleration speedups (and token healing) then you have to use an open model locally right now, though we are working on setting up a remote server solution as well. I expect APIs will adopt some support for more control over time, but right now commercial endpoints like OpenAI are supported through multiple calls.

We manage the KV-cache in session based way that allows the LLM to just take one forward pass through the whole program (only generating the tokens it needs to)

Yeah, I'm also curious about a) round trips and b) how much would have to be doubled (is there a new endpoint that keeps the existing context while adding or streams to the api rather than just from it?)
I'm getting valid JSON out of gpt-3.5-turbo without trouble. I supply an example via the assistant context, and tell it to output JSON with specific fields I name.

It does fail roughly 1/10th of the time, but it does work.

10% failure rate is too damn high for a production use case.

What production use case, you ask? You could do zero-shot entity extraction using ChatGPT if it were more reliable. Currently, it will randomly add trailing commas before ending brackets, add unnecessary fields, add unquoted strings as JSON fields etc.

Which is why this is just an experiment. I’ve gone back to standard translation APIs for everything except the final summarizing (and even them I might go there as well).
I built a similar thing to Grant's work a couple months ago and prototyped what this would look like against OpenAI's APIs [1]. TL;DR is that depending on how confusing your schema is, you might expect up to 5-10x the token usage for a particular prompt but better prompting can definitely reduce this significantly.

[1] https://github.com/newhouseb/clownfish#so-how-do-i-use-this-...