| So your requirements are: 1. Reliable speech to text from multiple possibly identifiable speakers 2. Long-term knowledge storage and retrieval Speech to text is a solved problem, AFAIK with a caveat: single speaker. You'd need to train a local AI to identify all these different voices reliably. No easy feat. Assuming you have done that, you have to feed that data into a vector database to retrieve it when you're talking to the AI. You can't use it to train the AI because it would be too expensive. But then you hit another roadblock: you either have very good querying capabilities for that database so you're able to retrieve what matters and feed into the prompt; or your context window is huge. The latter is expensive. Some commercial LLM implementations are already implementing some form of learning based on previous chats, so it might be doable from a cost perspective. I think you can't fit the necessary computing power into a wristband today. It needs to take care of speech to text (again, multiple speakers), uploading all of that to some cloud, and do it for hours and hours non-stop. Maybe it could just be a smart microphone that uploads a constant stream of audio to the cloud for processing? A privacy nightmares no one is willing to touch most likely. Would you have to ask permission from anyone in the room before you enter with your microphone? |
The OP asks for “a device that passively listens to your conversations”, so even if single speaker is solved perfectly (I wouldn’t know, but have my suspicions, certainly for a device worn on the wrist, which means it can rotate, be covered with a sweater, etc), that isn’t enough.