| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by visarga 964 days ago
	nowadays I am more interested in a "forgiving" JSON/YAML parser, that would recover from LLM errors, is there such a thing?

5 comments

explaininjs 963 days ago

Perhaps not quite what you're asking for, but along the same lines there's this "Incomplete JSON" parser, which takes a string of JSON as it's coming out of an LLM and parses it into as much data as it can get. Useful for building streaming UI's, for instance it is used on https://rexipie.com quite extensively.

https://gist.github.com/JacksonKearl/6778c02bf85495d1e39291c...

Some example test cases:

    { input: '[{"a": 0, "b":', output: [{ a: 0 }] },
    { input: '[{"a": 0, "b": 1', output: [{ a: 0, b: 1 }] },

    { input: "[{},", output: [{}] },
    { input: "[{},1", output: [{}, 1] },
    { input: '[{},"', output: [{}, ""] },
    { input: '[{},"abc', output: [{}, "abc"] },

Work could be done to optimize it, for instance add streaming support. But the cycles consumed either way is minimal for LLM-output-length=constrained JSON.

Fun fact: as best I can tell, GPT-4 is entirely unable to synthesize code to accomplish this task. Perhaps that will change as this implementation is made public, I do not know.

link

kevingadd 963 days ago

If the LLM did such a bad job that the syntax is wrong, do you really trust the data inside?

Forgiving parsers/lexers are common in language compilers for languages like rust or C# or typescript, you may want to investigate typescript in particular since it's applicable to JSON syntax. Maybe you could repurpose their parser.

link

RichieAHB 963 days ago

I feel like trying to infer valid JSON from invalid JSON is a recipe for garbage. You’d probably be better off doing a second pass with the “JSON” through the LLM but, as the sibling commenter said, at this point even the good JSON may be garbage …

link

gurrasson 963 days ago

The jsonrepair tool https://github.com/josdejong/jsonrepair might interest you. It's tailored to fix JSON strings.

I've been looking into something similar for handling partial JSONs, where you only have the first n chars of a JSON. This is common with LLM with streamed outputs aimed at reducing latency. If one knows the JSON schema ahead, then one can start processing these first fields before the remaining data has fully loaded. If you have to wait for the whole thing to load there is little point in streaming.

Was looking for a library that could do this parsing.

link

explaininjs 963 days ago

See my sibling comment :)

link

_dain_ 963 days ago

halloween was last week

link