Hacker News new | ask | show | jobs
by serjester 683 days ago
Does anyone have experience applying these models to rendered content (PDF's, webpages, etc). Seems like a really promising area of research to achieve LLM agents.
3 comments

Doesn’t work well for screen based content in general. One of the authors of SAM2 talked about this explicitly as not being a focus of theirs as it’s not foundational in the research space in the most recent latent space pod
> Doesn’t work well for screen based content in general.

It's not perfect, but it works: https://github.com/OpenAdaptAI/OpenAdapt/pull/610

> the most recent latent space pod

Link: https://www.latent.space/p/sam2

We are using Segment Anything Model at OpenAdapt for exactly this purpose: https://github.com/OpenAdaptAI/OpenAdapt/pull/610

It works surprisingly well despite the fact that the model was not trained on this type of data.