Hacker News new | ask | show | jobs
by mmoskal 559 days ago
It depends how fine-tuned the model is to JSON output.

Also, you need to tell the model the schema. If you don't you will get more weird tokenization issues.

For example, if the schema expects a JSON key "foobarbaz" and the canonical BPE tokenization is ["foobar", "baz"], the token mask generated by all current constrained output libraries will let the model choose from "f", "foo", "foobar" (assuming these are all valid tokens). The model might then choose "foo", and then the constraint will force eg. "bar" and "baz" as next tokens. Now the model will see ["foo", "bar", "baz"] instead of ["foobar", "baz"] and will get confused [0]

If the model knows from the prompt "foobarbaz" is one of the schema keys, it will generally prefer "foobar" over "foo".

[0] In modern models these tokens are related, because of regularization but they are not the same.