Hacker News new | ask | show | jobs
by adit_a 593 days ago
Part of the goal with releasing the dataset is to highlight how hard PDF parsing can be. Reducto models are SOTA, but they aren't perfect.

We constantly see alternatives show one ideal table to claim they're accurate. Being able to parse some tables is not hard.

What happens when it has merged cells, dense text, rotations, or no gridlines? Will your table outputs be the same when a user uploads a document twice?

Our team is relentlessly focused on solving for the true range of scenarios so our customers don't have to. Excited to share more about our next gen models soon.