| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jonathan-adly 557 days ago

So - the synthetic QAs datasets in the Vidore datasets are exactly like that 90% text, 10% charts/tables. OCR + BM25 is at ~90% NCDG@5 which is pretty decent. ColPali/Ours is at ~98%.

It is a small upgrade, but one nonetheless. The complexity, and the cost of multi-vectors *might* not make this worth it, really depends on how accuracy-critical the task is.

For example, one of our customers who does this over FDA monographs, which is like 95%+ text, and 5% tables - they misses were extremely painful - even though there weren't that many in text-based pipelines. So, the migrations made sense to them.