|
|
|
|
|
by ajcp
542 days ago
|
|
We have been using Gemini 1.5 Flash at enterprise scale on a massively varied, form based document dataset and we have yet to see hallucinations, on either our ground-truth dataset or in our random audits for evaluation. Just to make sure though we threw some recursion on it: we take the output and give it right back to the model with the original prompt and output and ask it how accurate it is. If it thinks it's not accurate we tell it to rewrite the original prompt to provide for a more accurate output. We then stuff that right back down its own function :) For your use-case it would be exponentially easier as all you'd need to provide Gemini your "zones" as the JSON schema for output and it will quite reliably identify them. |
|
OCR seems good, on par with Google Vision.
But the footnotes are not properly identified on most pages; they are properly identified when there is a large gap and the first line of the footnotes starts with a number; but when the footnotes block starts with text (continuing a footnote from a previous page) and/or the gap is small or almost non-existent, it fails (all text on the page is considered belonging to main text).
But the main problem isn't even that, it's that it takes between 10 to 20 seconds per page. That would mean over three hours per volume of 600 pages. Google Vision takes less than one second per page.
It's possible there is a setup cost and that doing batches or even full PDFs would be better, though. Do you have experience with this? And can you maybe share "prompt secrets" that would improve the results...?