Hacker News new | ask | show | jobs
by tlofreso 589 days ago
"accurate document extraction is becoming a commodity with powerful VLMs"

Agree.

The capability is fairly trivial for orgs with decent technical talent. The tech / processes all look similar:

User uploads file --> Azure prebuilt-layout returns .MD --> prompt + .MD + schema set to LLM --> JSON returned. Do whatever you want with it.

2 comments

Totally agree that this is becoming the standard "reference architecture" for this kind of pipeline. The only thing that complicates this a lot today is complex inputs. For simple 1-2 page PDFs what you describes works quite well out of the box but for 100+ page doc it starts to fall over in ways I described in another comment.
Are really large inputs solved at midship? If so, I'd consider that a differentiator (at least today). The demo's limited to 15pgs, and I don't see any marketing around long-context or complex inputs on the site.

I suspect this problem gets solved in the next iteration or two of commodity models. In the meantime, being smart about how the context gets divvied works ok.

I do like the UI you appear to have for citing information. Drawing the polygons around the data, and then where they appear in the PDF. Nice.

Why all those steps? Why not just file + prompt to JSON directly?
Having the text (for now) is still pretty important for quality output. The vision models are quite good, but not a replacement for a quality OCR step. A combination of Text + Vision is compelling too.