| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fzysingularity 480 days ago

Saw your benchmark, looks great. Will run our models against those benchmark and share some of our learnings.

As you mentioned there are a few caveats to VLMs that folks are typically unaware of (not at all exhaustive, but the ones you highlighted):

1. Long-form text (dense): Token limits of 4/8K mean that dense pages may go over limits of the LLM outputs. This requires some careful work to make them work as seamlessly as OCR.

2. Visual grounding a.k.a. bounding boxes are definitely one of those things that VLMs aren't natively good at (partly because the cross-entropy losses used aren't really geared for bounding box regression). We're definitely making some strides here [1] to improve that so you're going to get an experience that is almost as good as native bounding box regression (all within the same VLM). [1]

[1] https://colab.research.google.com/github/vlm-run/vlmrun-cook...