| Not sure I follow. If I were building this, I would have parallel background "subconscious" processes translate raw sensory inputs into text (tokens). This is what OpenAI call multi-modal input. They've already produced Whisper for audio-to-text, and image-to-text is underway. They're not the only company working on this. You wouldn't feed a constant stream of text data into the LLM - you'd feed deltas at regular intervals based on the processing speed of your LLM, and supply history for context. Note that LLMs don't need to "wait" for a complete input. For example, if an LLM takes 1 second to process requests, we should aim to feed updates from the "subconscious" to the "conscious" within 1 second. So if somebody is speaking a 10-second long sentence, we don't wait 10 seconds to send the sentence to the LLM. After 1 second, we send the following to the LLM: "Bob, still speaking, has said 'How much wood...'". After 2 seconds we send 'Bob, still speaking, has said 'How much wood could a woodchuck...'", etc. The LLM can be instructed not to respond to Bob until it has a meaningful response or interjection to the current input. Similarly, if image-to-text takes 10 seconds at full resolution, we could first process a frame at a resolution that only takes 1 second, and provide that information - with the caveat it is uncertain information - to the LLM while we continue to work on a full resolution frame in the background. We can optimise by not processing previously processed scenes, by focusing only on areas of the image that have changed, etc. Would it be slow? Yes, just like "conscious" processing is for humans. Okay so today it would be much slower than humans process their sensory input, but in 10 years? 20? As for how to represent an urge to reproduce within this paradigm - I'll leave that as exercise for the reader. |
In a discussion about whether LLMs could have agency and generalised reasoning ability, you suggested it was unfair because they hadn't received all the i/o a typical human did.
I pointed out that LLMs wouldn't be able to reason about that i/o (and if we made it fully fair and trained them on the comparatively small subset of text and discernible words humans learn from, they'd probably lose their facility with language too)
I don't disagree that bolting LLMs to other highly trained models like a program for controlling a robot arm and intensively training can yield useful results, arguably much more useful results than building a digital facsimile of a human toddler (toddlers produce pretty useless outputs but also have stuff going on internally we can barely begin to adequately replicate in silicon). But that isn't exposing an LLM to equivalents of human sensory input to get back an autonomous agent with generalised reasoning capacity, that's manually bolting together discrete specialised programs to an LLM as-message-passing-layer to have a machine which, given more training is capable of a slightly broader range of specialised tasks.