| HN Mirror

I think if the same LLM were trained on audio and video input instead of text, and produced audio output, including silence tokens, then the notion of time would get "built in". Audio continuation without translation to text has been shown to work. Mixing it with text is also possible. But all this would require a massive network that maybe even be difficult for the world's biggest companies to train and serve at any kind of scale. So it's more of an engineering problem than a theoretical one imho.

Also imho, I think until the context/memory problem is fully solved we won't really see the AI as having any kind of agency. But continuous, low latency interaction would certainly feel like a step towards that.