Does anyone have experience applying these models to rendered content (PDF's, webpages, etc). Seems like a really promising area of research to achieve LLM agents.
Doesn’t work well for screen based content in general. One of the authors of SAM2 talked about this explicitly as not being a focus of theirs as it’s not foundational in the research space in the most recent latent space pod