With this LLM approach you can at least create your training data from the raw images with natural language.