Imagine porting this to a dedicated app that can access the context of the open window and the text on the screen, providing an almost real-time assistant for everything you do on screen.
Automatically take a screenshot and feed it to https://github.com/vikhyat/moondream or similar? Doable. But while very impressive, the results are a bit of mixed bag (some hallucinations)