| > Now the LLM can choose to switch, at its own discretion, back and forth between a talking and listening mode How would it intelligently do this? What data would you train on? You don't have trillions words of text where humans wrote what they thought silently interwoven with what they wrote publicly. History has shown over and over that hard coded ad hoc solutions to these "simple problems" never work to create intelligent agents, you need to train the model to do that from the start you can't patch in intelligence after the fact. Those additions can be useful, but they have never been intelligent. Anyway, such a model I'd call "stream of mind model" rather than a language model, it would fundamentally solve many of the problems with current LLM where their thinking is reliant on the shape of the answer, while a stream of mind model would shape its thinking to fit the problem and then shape the formatting to fit the communication needs. Such a model as this guy describes would be a massive step forward, so I agree with this, but it is way too expensive to train, not due to lack of compute but due to lack of data. And I don't see that data being done within the next decade if ever, humans don't really like writing down their hidden thoughts, and you'd need to pay them to generate data amounts equivalent to the internet... |
It's a fair question, and I don't have all the answers. But for this question, there might be training data available from everyday human conversations. For example, we could use a speech-to-text model that's able to distinguish speakers, and look for points where one person decided to start speaking (that would be training data for when to switch modes). Ideally, the speech-to-text model would be able to include text even when both people spoke at once (this would provide more realistic and complete training data).
I've noticed that the audio mode in ChatGPT's app is good at noticing when I'm done speaking to it, and it reacts accurately enough that I suspect it's more sophisticated than "wait for silence." If there is a "notice the end of speaking" model - which is not a crazy assumption - then I can imagine a slightly more complicated model that notices a combination of "now is a good time to talk + I have something to say."