1. Uses half-vecs, so you cut down everything by half with no recall loss
2. Uses token pooling with hierarchial clustering at 3, so, you further cut down things by 2/3rd with <1% loss
3. Everything is on Postgres and pgvector, so you can do all the Postgres stuff and decrease corpus size by document metadata filtering
4. We have a 5000+ pages corpus in production with <3 seconds latency.
5. We benchmark against the Vidore leaderboard, and very near SOTA
I really like the idea of ColPali and products building on it but I am still unsure about the applications for which it makes most sense. We mostly deal with reports that are 80-90% text, 10-20% figures and tables. Does a vision first approach makes sense in this context? My sense is that text-based embeddings are better in mostly text contexts. Layout, for example, is pretty much irrelevant but plays into vision-based approaches. What is your sense about this?
So - the synthetic QAs datasets in the Vidore datasets are exactly like that 90% text, 10% charts/tables. OCR + BM25 is at ~90% NCDG@5 which is pretty decent. ColPali/Ours is at ~98%.
It is a small upgrade, but one nonetheless. The complexity, and the cost of multi-vectors *might* not make this worth it, really depends on how accuracy-critical the task is.
For example, one of our customers who does this over FDA monographs, which is like 95%+ text, and 5% tables - they misses were extremely painful - even though there weren't that many in text-based pipelines. So, the migrations made sense to them.