|
|
|
|
|
by profile53
920 days ago
|
|
At a purely technical level, no, as long as the model can output a null token. E.g. imagine training using a transcript of two people talking. What would be a single text token is a tuple of two tokens, one per person. Each segment where a person is not talking is a series of null tokens, one per ‘tick’ of time. In an actual conversation, one token in the tuple is user input and one is GPT prediction. Just disregard the user half of the tuple when determining whether the GPT should ‘speak’. The real world challenge is threefold. First, null tokens would be massively over represented in training and by extent, in outputs. Second, at a computational level, outputting a continuous stream of tokens would be absurdly expensive. Third, there is not nearly as much training data of interspersed conversations as of monologues (e.g. research papers, this comment, etc.). |
|
However it might seem expensive yes, but at least it only has to respond with one token.