|
|
|
|
|
by notahacker
1045 days ago
|
|
> Who knows what will happen if we plug it into a body with senses, limbs, and reproductive capabilities I would imagine that its layers will be far too occupied by parsing constant flows of sensory information to transform corpuses of text and prompt into speedy and polite text replies, never mind acquire the urge to reproduce by reasoning from first principles about the text. Test's quite unfair the other way round too. Most humans don't get to parse the entire canon of Western thought and Reddit before being asked to pattern match human conversation, never mind before having any semblance of agency... Maybe we're just... different. |
|
If I were building this, I would have parallel background "subconscious" processes translate raw sensory inputs into text (tokens).
This is what OpenAI call multi-modal input. They've already produced Whisper for audio-to-text, and image-to-text is underway. They're not the only company working on this.
You wouldn't feed a constant stream of text data into the LLM - you'd feed deltas at regular intervals based on the processing speed of your LLM, and supply history for context.
Note that LLMs don't need to "wait" for a complete input. For example, if an LLM takes 1 second to process requests, we should aim to feed updates from the "subconscious" to the "conscious" within 1 second.
So if somebody is speaking a 10-second long sentence, we don't wait 10 seconds to send the sentence to the LLM. After 1 second, we send the following to the LLM: "Bob, still speaking, has said 'How much wood...'". After 2 seconds we send 'Bob, still speaking, has said 'How much wood could a woodchuck...'", etc. The LLM can be instructed not to respond to Bob until it has a meaningful response or interjection to the current input.
Similarly, if image-to-text takes 10 seconds at full resolution, we could first process a frame at a resolution that only takes 1 second, and provide that information - with the caveat it is uncertain information - to the LLM while we continue to work on a full resolution frame in the background. We can optimise by not processing previously processed scenes, by focusing only on areas of the image that have changed, etc.
Would it be slow? Yes, just like "conscious" processing is for humans. Okay so today it would be much slower than humans process their sensory input, but in 10 years? 20?
As for how to represent an urge to reproduce within this paradigm - I'll leave that as exercise for the reader.