Hacker News new | ask | show | jobs
by 2bitencryption 1042 days ago
it still blows my mind that OpenAI exposes an API with Functions calling, and yet does not guarantee the model will call your function correctly, in fact, it does not even guarantee the output will be valid JSON.

When this is, really, a solved problem. I've been using github.com/microsoft/guidance for weeks, and it genuinely, truly guarantees correct output, because it simply does not sample from tokens that would be invalid.

It just seems so obvious, I still have no clue why OpenAI does not do this. Like, why fuss around with validating JSON after the fact, when you can simply guarantee it is correct in the first place, by only sampling tokens if they conform to the grammar you are trying to emit?

3 comments

IANA{LLM}, but if you're only sampling from a "correct" grammar, you are potentially (very potentially) forgoing what might otherwise have been a more desirable and more semantically useful token. Most of the models have been trained on myriads of human language, not structured data necessarily, and so I'd rather elect for a more semantically enriched format (e.g. XML or YAML) because those are designed to be ~more human readable. Or perhaps more preferably: have the boss LLM pump out what it excels at (strings of prose most of the time) and have a secondary model with a stricter grammar convert that to JSON.
I think this is likely a consequence of a couple of factors:

1. Fancy token selection w/in batches (read: beam search) is probably fairly hard to implement at scale without a significant loss in GPU utilization. Normally you can batch up a bunch of parallel generations and just push them all through the LLM at once because every generated token (of similar prompt size + some padding perhaps) takes a predictable time. If you stick a parser in between every token that can take variable time then your batch is slowed by the most complex grammar of the bunch.

2. OpenAI appears to work under the thesis articulated in the Bitter Lesson [i] that more compute (either via fine-tuning or bigger models) is the least foolish way to achieve improved capabilities hence their approach of function-calling just being... a fine tuned model.

[i] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

The "Bitter Lesson" indeed sheds light on the future trajectory of technology, emphasizing the supremacy of computation over human-designed methods. However, our current value functions often still need to focus on what we can achieve with the tools and methods available to us today. While it's likely that computational tools will eventually replace human-guided "outlines" or "guidance", that are used to shape LLM outputs, there will likely always be a substantial amount of human-structured knobs necessary to align computation with our immediate needs and goals.
What a fascinating read, thanks for sharing that link.
I just left a comment along these lines, but realistically it's probably cheaper to just re-emit than to add the machinery that enables this to their existing architecture.

At most I could have seen them maybe running a schema validator against the output and re-requesting on your behalf, but even that's probably cheaper for them to do client side (I will say, I'm surprised their API wrapper hasn't been updated to do this yet)

> maybe running a schema validator against the output and re-requesting on your behalf

this is the part that blows my mind. You don't have to do this! You don't have to sample the entire output, and then validate after the fact.

You're not required to greedily pick the token with the highest score. You get the scores of all tokens, on every forward pass! So why even waste time picking invalid tokens if you're just going to validate and retry later on??

(note: when I say "you" here, I mean whoever is hosting the model. It is true that OpenAI does not expose all token scores, it only gives you back the highest-scoring one. So a client-side library is not able to perform this grammar-based sampling.

BUT, OpenAI themselves host host the model, and they see all token outputs, with all scores. And in the same API request, they allow you to pass the "function definition" as a JSON schema. So why not simply apply that function definition as a mask on the token outputs? They could do this without exposing all token scores to you, which they seem very opposed to for some reason.)

Maybe re-read what I said?

> realistically it's probably cheaper to just re-emit than to add the machinery that enables this to their existing architecture

There are literally dozens of random projects that have implemented logit based masking, it's a trivial thing to implement.

What's probably not as trivial is deploying it at scale with whatever architecture OpenAI already has in place. Especially if they're using the router-based MoE architecture most people are assuming they use.

OpenAI doesn't expose token probabilities for their RLHF models, yet they did for GPT-3. Originally that lead to speculation that was to make building competitors harder, but they've now said they're actually still working on it... which leans even further into the idea they may have an architecture that makes the kind of sampling these projects rely on more difficult to implement than normal.

Given how fast and cheap they've made access to these models, their current approach is a practical workaround if that's the case.

when GPT-4 first became available, I had a feeling that something about it felt “hacky”. Compared to GPT-3 which was more streamlined, mature, and well thought out, GPT-4 was like a system put together to outperform the previous one at all costs. I wouldn’t be surprised if that led to design decisions that made their model hard to improve. Maybe GPT-5 will not be around any time soon.