Got it, thanks. Certainly a very interesting and active space. I was playing around with FLARE (https://arxiv.org/abs/2305.06983) for RAG this week, and LMQL (mentioned by another poster) seems to use a similar technique.
In response to your sister comment: the implementation we used was the naive one from LangChain (https://python.langchain.com/docs/modules/chains/additional/...). We've decomposed that to use as a starting point but early results are promising, yes, although it doesn't yet seem to be possible to get the necessary `logprobs` out of the GPT-4 API, so we're stuck with 3.5-turbo atm.
Using tiktokenizer, these are only two tokens: quote-colon is token 498, space-quote is token 330 (as per https://tiktokenizer.vercel.app/ ). But I agree to the general argument.
I think what factors in even more when you use the API is that you do not have fine-grained control over the generation process. If you follow the MS guidance approach, you fill in structured text yourself, and then let the model generate only the value parts, e.g. up to the next quote. To do that more or less word by word, you have multiple API calls, and have to be very smart about providing the right stop tokens.
Does this help clarify?