Has anyone experiment with using VLM to detect "marks"? Thinking of pen/pencil based markings like underlines, circles,checkmarks.. Can these models do it?
None of them do it well from our experience. We had to write our own custom pipeline with a mixture of legacy CV approaches to handle this (AI contract analysis). We constantly benchmark every new multimodal and VLM model that comes out and are consistently disappointed.