| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by notahacker 1092 days ago

> Who knows what will happen if we plug it into a body with senses, limbs, and reproductive capabilities

I would imagine that its layers will be far too occupied by parsing constant flows of sensory information to transform corpuses of text and prompt into speedy and polite text replies, never mind acquire the urge to reproduce by reasoning from first principles about the text.

Test's quite unfair the other way round too. Most humans don't get to parse the entire canon of Western thought and Reddit before being asked to pattern match human conversation, never mind before having any semblance of agency...

Maybe we're just... different.

1 comments

ryanjshaw 1092 days ago

Not sure I follow.

If I were building this, I would have parallel background "subconscious" processes translate raw sensory inputs into text (tokens).

This is what OpenAI call multi-modal input. They've already produced Whisper for audio-to-text, and image-to-text is underway. They're not the only company working on this.

You wouldn't feed a constant stream of text data into the LLM - you'd feed deltas at regular intervals based on the processing speed of your LLM, and supply history for context.

Note that LLMs don't need to "wait" for a complete input. For example, if an LLM takes 1 second to process requests, we should aim to feed updates from the "subconscious" to the "conscious" within 1 second.

So if somebody is speaking a 10-second long sentence, we don't wait 10 seconds to send the sentence to the LLM. After 1 second, we send the following to the LLM: "Bob, still speaking, has said 'How much wood...'". After 2 seconds we send 'Bob, still speaking, has said 'How much wood could a woodchuck...'", etc. The LLM can be instructed not to respond to Bob until it has a meaningful response or interjection to the current input.

Similarly, if image-to-text takes 10 seconds at full resolution, we could first process a frame at a resolution that only takes 1 second, and provide that information - with the caveat it is uncertain information - to the LLM while we continue to work on a full resolution frame in the background. We can optimise by not processing previously processed scenes, by focusing only on areas of the image that have changed, etc.

Would it be slow? Yes, just like "conscious" processing is for humans. Okay so today it would be much slower than humans process their sensory input, but in 10 years? 20?

As for how to represent an urge to reproduce within this paradigm - I'll leave that as exercise for the reader.

link

notahacker 1090 days ago

Not sure I follow the reply really.

In a discussion about whether LLMs could have agency and generalised reasoning ability, you suggested it was unfair because they hadn't received all the i/o a typical human did.

I pointed out that LLMs wouldn't be able to reason about that i/o (and if we made it fully fair and trained them on the comparatively small subset of text and discernible words humans learn from, they'd probably lose their facility with language too)

I don't disagree that bolting LLMs to other highly trained models like a program for controlling a robot arm and intensively training can yield useful results, arguably much more useful results than building a digital facsimile of a human toddler (toddlers produce pretty useless outputs but also have stuff going on internally we can barely begin to adequately replicate in silicon). But that isn't exposing an LLM to equivalents of human sensory input to get back an autonomous agent with generalised reasoning capacity, that's manually bolting together discrete specialised programs to an LLM as-message-passing-layer to have a machine which, given more training is capable of a slightly broader range of specialised tasks.

link

YeGoblynQueenne 1092 days ago

>> They've already produced Whisper for audio-to-text, and image-to-text is underway.

Two modalities down. Another couple hundred to go.

Unfortunately we're fast running out of the modalities that neural nets have shown capability in (image, text, sound... I think that's it).

link