I would love to use this in a project if it could also caption embedded images to produce something for RAG...