|
We've been building table extraction at Pulse and evaluated four benchmarks: OmniDocBench, SCORE-Bench, ParseBench, and RD-TableBench. None of them fully reflect the enterprise document workflows we've encountered in production. TEDS (OmniDocBench) penalizes HTML formatting differences that don't affect the actual table, so the same 3x3 grid scores differently depending on whether headers use <thead> vs <tr>, and the benchmark only covers English and Chinese plus a small mixed category. SCORE-Bench's spatial tolerance parameter can mask real failures, because if you drop a header row and shift all data up by one with delta=1, the benchmark reports high accuracy even though the column labels are gone. ParseBench generates its ground truth with frontier VLMs (Claude Opus for tables), which introduces hallucination risk, and its TableRecordMatch metric treats tables as unordered bags of key-value records, so it doesn't penalize column transposition or row reordering. The table set is also 503 pages, English-only, with over half from a single source. RD-TableBench linearizes tables into 1D sequences, losing horizontal vs vertical adjacency.
The RD-TableBench ground truth audit is what concerned us most. We went through all 1,000 ground truth files against the source images, and the errors consisted of scrambled text and wrong structure, garbled OCR on CJK and Arabic, and buffer artifacts where random digit sequences got appended to real numeric values. Dozens of ground truth files are byte-for-byte identical to one provider's output, and in a subset of the error cases the ground truth and that provider share the exact same specific error (same wrong word order in headers, same watermark text pulled into cells, same garbled CJK characters) while independent providers don't produce those errors. This also motivated us to build PulseBench-Tab, a benchmark of 1,820 human-annotated tables across 9 languages and 4 scripts, with graph-based evaluation via T-LAG that operates on the parsed grid rather than the DOM tree, and fully open ground truth, scoring code, and provider outputs. Arabic and Korean both show 75+ point spreads across providers, and everything is available on HuggingFace and GitHub. |