|
|
|
|
|
by madethemcry
262 days ago
|
|
That was as great reading, thank you. I've a related observation. In my experience the amount of hallucinated urls with structured output (think of a field `url` or `link`) is pretty high. Especially compared to the alternative approach, where you let the llm generate text and then use a second llm to convert the text into the desired structured format. With structured output, it's like the llm is forced to answer in a very specific way. So if there is no url for the given field, it makes up the url. Here a related quote from the article: > Structured outputs builds on top of sampling by constraining the model's output to a specific format. |
|
E.g. if the LLM hallucinates non-existing URLs, you may add a boolean "contains_url" field to your entity's JSON schema, placing it before the URL field itself. This way, the URL extraction is split into two simpler steps, checking if the URL is there and actually extracting it. If the URL is missing, the `"contains_url": false` field in the context will strongly urge the LLM to output an empty string there.
This also comes up with quantities a lot. Imagine you're trying to sort job adverts by salary ranges, which you extract via LLm. . These may be expressed as monthly instead of annual (common in some countries), in different currencies, pre / post tax etc.
Instead of having an `annual_pretax_salary_usd` field, which is what you actually want, but which the LLM is extremely ill-equipped to generate, have a detailed schema like `type: monthly|yearly, currency:str, low:float, high:float, tax: pre_tax|post_tax`.
That schema is much easier for an LLM to generate, and you can then convert it to a single number via straight code.