| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by robertclaus 819 days ago

"Setting up LLMs to output structured data is incredibly hard." resonated strongly with my experience working in similar one-off projects. I've almost always implemented some level of fuzzy-matching to validate and convert the LLM output back into my expected structured format.

I've also noticed that the LLMs are much better at writing code than structured JSON (no real surprise given the popularity of code assistants). If it makes sense in the specific situation, I now have the LLM generate code and parse it into the right structure rather than requesting structured data directly:

`generate_event("I need to do X", new Date("1-1-2025"))` seems to be more reliable to generate than `{ "description": "I need to do X", "when": "1-1-2025" }`

7 comments

BoorishBears 819 days ago

I really don't get what people are doing wrong here.

I have a 7000 token prompt that generates JSON chugging away in production and at scale I'm seeing ~1 in 4000 generations require a re-generation, and even that could probably be killed with some basic "healing" code.

OSS are prone to outputting garbage in my experience, but OP mentions ChatGPT:

How are you running into issues if you simply prefill the response with ```json and set ``` as your stop token?

Also, are people also just not trying to parse the opening and closing bracket and treating it as broken if there's a preamble? The prefill gets rid of the preamble, but if you're not willing/able to prefill, how hard is getting JSON out of a string?

link

refulgentis 819 days ago

If you're doing it locally, it's likely got llama.cpp underneath it somewhere. Ask the dev to allow specifying a JSON schema via using its grammar feature.

link

el_nahual 819 days ago

As long as you can sanitize the LLM output somehow. You should never `eval` LLM code straight from the tap!

link

dns_snek 819 days ago

You shouldn't sanitize, if you're taking the approach described above, you should run it inside a minimal interpreter that doesn't implement any potentially dangerous APIs.

link

Mathnerd314 819 days ago

I think it's the training data, there is not a lot of JSON. It's much easier to get it to generate list-style data, like "foo:\n* prop1 - val1\n* prop2 - val2", or similar formats, as the models seem to have seen a lot of that sort of data.

link

bboygravity 819 days ago

You're aware ChatGPT4 has a json only mode?

link

nosefurhairdo 819 days ago

Some ideas in this forum thread on the same topic: https://genai.stackexchange.com/questions/202/how-to-generat...

link

Zetobal 819 days ago

openai, Claude, Mistral large and all models that you can infer with ollama have JSON only modes!?

link