Pretty cool idea!
I like the "passive" approach.
Few questions: Does it automatically take screenshots each X seconds?
And which models does it run locally to analyze the images and do the audio transcriptions?
For voice I use Apple's SFSpeechRecognizer. I'm thinking of switching that to an OS model, but the memory footprint of the application is already very high.
It currently takes pictures every 30 seconds and whenever you switch applications.
I use https://huggingface.co/mlx-community/gemma-3-4b-it-qat-4bit to do the chat/image recognition and Qwen/Qwen3-Embedding-0.6B-4bit and Qwen3-Reranker-0.6B-4bit to help in search related features.
For voice I use Apple's SFSpeechRecognizer. I'm thinking of switching that to an OS model, but the memory footprint of the application is already very high.