Hacker News new | ask | show | jobs
by robertclaus 819 days ago
"Setting up LLMs to output structured data is incredibly hard." resonated strongly with my experience working in similar one-off projects. I've almost always implemented some level of fuzzy-matching to validate and convert the LLM output back into my expected structured format.

I've also noticed that the LLMs are much better at writing code than structured JSON (no real surprise given the popularity of code assistants). If it makes sense in the specific situation, I now have the LLM generate code and parse it into the right structure rather than requesting structured data directly:

`generate_event("I need to do X", new Date("1-1-2025"))` seems to be more reliable to generate than `{ "description": "I need to do X", "when": "1-1-2025" }`

7 comments

I really don't get what people are doing wrong here.

I have a 7000 token prompt that generates JSON chugging away in production and at scale I'm seeing ~1 in 4000 generations require a re-generation, and even that could probably be killed with some basic "healing" code.

OSS are prone to outputting garbage in my experience, but OP mentions ChatGPT:

How are you running into issues if you simply prefill the response with ```json and set ``` as your stop token?

Also, are people also just not trying to parse the opening and closing bracket and treating it as broken if there's a preamble? The prefill gets rid of the preamble, but if you're not willing/able to prefill, how hard is getting JSON out of a string?

If you're doing it locally, it's likely got llama.cpp underneath it somewhere. Ask the dev to allow specifying a JSON schema via using its grammar feature.
As long as you can sanitize the LLM output somehow. You should never `eval` LLM code straight from the tap!
You shouldn't sanitize, if you're taking the approach described above, you should run it inside a minimal interpreter that doesn't implement any potentially dangerous APIs.
I think it's the training data, there is not a lot of JSON. It's much easier to get it to generate list-style data, like "foo:\n* prop1 - val1\n* prop2 - val2", or similar formats, as the models seem to have seen a lot of that sort of data.
You're aware ChatGPT4 has a json only mode?
Some ideas in this forum thread on the same topic: https://genai.stackexchange.com/questions/202/how-to-generat...
openai, Claude, Mistral large and all models that you can infer with ollama have JSON only modes!?