Hacker News new | ask | show | jobs
by profile53 920 days ago
At a purely technical level, no, as long as the model can output a null token. E.g. imagine training using a transcript of two people talking. What would be a single text token is a tuple of two tokens, one per person. Each segment where a person is not talking is a series of null tokens, one per ‘tick’ of time. In an actual conversation, one token in the tuple is user input and one is GPT prediction. Just disregard the user half of the tuple when determining whether the GPT should ‘speak’.

The real world challenge is threefold. First, null tokens would be massively over represented in training and by extent, in outputs. Second, at a computational level, outputting a continuous stream of tokens would be absurdly expensive. Third, there is not nearly as much training data of interspersed conversations as of monologues (e.g. research papers, this comment, etc.).

2 comments

I think you should be able to do it out of the box if you just keep sending the tokens, and after that ask the GPT "is there a mistake? Respond with just "yes" or "no". Why does there have to be something like a "null" token?

However it might seem expensive yes, but at least it only has to respond with one token.

There’s a null token because the question was about you not having to ask if there was a mistake. It would just default to constantly producing a null token until it had a real response
Yeah it seems the notion of time is sort of not built in conceptually to current systems. You could pick a fixed time constant like 0.1 seconds or 1 second, but it's clear that it's sort of missing something more fundamental.
I think if the same LLM were trained on audio and video input instead of text, and produced audio output, including silence tokens, then the notion of time would get "built in". Audio continuation without translation to text has been shown to work. Mixing it with text is also possible. But all this would require a massive network that maybe even be difficult for the world's biggest companies to train and serve at any kind of scale. So it's more of an engineering problem than a theoretical one imho.

Also imho, I think until the context/memory problem is fully solved we won't really see the AI as having any kind of agency. But continuous, low latency interaction would certainly feel like a step towards that.