| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wkcheng 701 days ago
	It's insane that that this works, and that it works fast enough to render at 20 fps. It seems like they almost made a cross between a diffusion model and an RNN, since they had to encode the previous frames and actions and feed it into the model at each step. Abstractly, it's like the model is dreaming of a game that it played a lot of, and real time inputs just change the state of the dream. It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

10 comments

lokimedes 701 days ago

It makes good sense for humans to have this ability. If we flip the argument, and see the next frame as a hypothesis for what is expected as the outcome of the current frame, then comparing this "hypothesis" with what is sensed makes it easier to process the differences, rather than the totality of the sensory input.

As Richard Dawkins recently put it in a podcast[1], our genes are great prediction machines, as their continued survival rests on it. Being able to generate a visual prediction fits perfectly with the amount of resources we dedicate to sight.

If that is the case, what does aphantasia tell us?

[1] https://podcasts.apple.com/dk/podcast/into-the-impossible-wi...

dbspin 701 days ago

Worth noting that aphantasia doesn't necessarily extend to dreams. Anecdotally - I have pretty severe aphantasia (I can conjure milisecond glimpses of barely tangible imagery that I can't quite perceive before it's gone - but only since learning that visualisation wasn't a linguistic metaphor). I can't really simulate object rotation. I can't really 'picture' how things will look before they're drawn / built etc. However I often have highly vivid dream imagery. I also have excellent recognition of faces and places (e.g.: can't get lost in a new city). So there clearly is a lot of preconscious visualisation and image matching going on in some aphantasia cases, even where the explicit visual screen is all but absent.

lokimedes 701 days ago

I fabulate about this in another comment below:

> Many people with aphantasia reports being able to visualize in their dreams, meaning that they don't lack the ability to generate visuals. So it may be that the [aphantasia] brain has an affinity to rely on the abstract representation when "thinking", while dreaming still uses the "stable diffusion mode".

(I obviously don't know what I'm talking about, just a fellow aphant)

dbspin 701 days ago

Obviously we're all introspecting here - but my guess is that there's some kind of cross talk in aphantasic brains between the conscious narrating semantic brain and the visual module. Such that default mode visualisation is impaired. It's specifically the loss of reflexive consciousness that allows visuals to emerge. Not sure if this is related, but I have pretty severe chronic insomnia, and I often wonder if this in part relates to the inability to drift off into imagery.

drowsspa 701 days ago

Yeah. In my head it's like I'm manipulating SVG paths instead of raw pixels

zimpenfish 701 days ago

Pretty much the same for me. My aphantasia is total (no images at all) but still ludicrously vivid dreams and not too bad at recognising people and places.

jonplackett 701 days ago

What’s the aphantasia link? I’ve got aphantasia. I’m convinced though that the bit of my brain that should be making images is used for letting me ‘see’ how things are connected together very easily in my head. Also I still love games like Pictionary and can somehow draw things onto paper than I don’t really know what they look like in my head. It’s often a surprise when pen meets paper.

lokimedes 701 days ago

I agree, it is my own experience as well. Craig Venter In one of his books also credit this way of representing knowledge as abstractions as his strength in inventing new concepts.

The link may be that we actually see differences between “frames”, rather than the frames directly. That in itself would imply that a from of sub-visual representation is being processed by our brain. For aphantasia, it could be that we work directly on this representation instead of recalling imagery through the visual system.

Many people with aphantasia reports being able to visualize in their dreams, meaning that they don't lack the ability to generate visuals. So it may be that the brain has an affinity to rely on the abstract representation when "thinking", while dreaming still uses the "stable diffusion mode".

I’m no where near qualified to speak of this with certainty, but it seems plausible to me.

quickestpoint 701 days ago

As Richard Dawkins theorized, would be more accurate and less LLM like :)

nsbk 701 days ago

We are. At least that's what Lisa Feldman Barrett [1] thinks. It is worth listening to this Lex Fridman podcast: Counterintuitive Ideas About How the Brain Works [2], where she explains among other ideas how constant prediction is the most efficient way of running a brain as opposed to reaction. I never get tired of listening to her, she's such a great science communicator.

[1] https://en.wikipedia.org/wiki/Lisa_Feldman_Barrett

[2] https://www.youtube.com/watch?v=NbdRIVCBqNI&t=1443s

PunchTornado 701 days ago

Interesting talk about the brain, but the stuff she says about free will is not a very good argument. Basically it is sort of the argument that the ancient greeks made which brings the discussion into a point where you can take both directions.

stevenhuang 701 days ago

> It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

Yup, see https://en.wikipedia.org/wiki/Predictive_coding

quickestpoint 701 days ago

Umm, that’s a theory.

mind-blight 701 days ago

So are gravity and friction. I don't know how well tested or accepted it is, but being just a theory doesn't tell you much about how true it is without more info

bangaladore 700 days ago

> It's insane that that this works, and that it works fast enough to render at 20 fps.

It is running on an entire v5 TPU (https://cloud.google.com/blog/products/ai-machine-learning/i...)

It's unclear how that compares to a high-end consumer GPU like a 3090, but they seem to have similar INT8 TFLOPS. The TPU has less memory (16 vs. 24), and I'm unsure of the other specs.

Something doesn't add up, in my opinion, though. SD usually takes (at minimum) seconds to produce a high-quality result on a 3090, so I can't comprehend how they are like 2 orders of magnitudes faster—indicating that the TPU vastly outperforms a GPU for this task. They seem to be producing low-res (320x240) images, but it still seems too fast.

Philpax 700 days ago

There's been a lot of work in optimising inference speed of SD - SD Turbo, latent consistency models, Hyper-SD, etc. It is very possible to hit these frame rates now.

dartos 701 days ago

> It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

This, to me, seems extremely reductionist. Like you start with AI and work backwards until you frame all cognition as next something predictors.

It’s just the stochastic parrot argument again.

wrsh07 701 days ago

Makes me wonder when an update to the world models paper comes out where they drop in diffusion models: https://worldmodels.github.io/

Teever 701 days ago

Also recursion and nested virtualization. We can dream about dreaming and imagine different scenarios, some completely fictional or simply possible future scenarios all while doing day to day stuff.

mensetmanusman 701 days ago

Penrose (Nobel prize in physics) stipulates that quantum effects in the brain may allow a certain amount of time travel and back propagation to accomplish this.

wrsh07 701 days ago

You don't need back propagation to learn

This is an incredibly complex hypothesis that doesn't really seem justified by the evidence

richard___ 701 days ago

Did they take in the entire history as context?

slashdave 701 days ago

Image is 2D. Video is 3D. The mathematical extension is obvious. In this case, low resolution 2D (pixels), and the third dimension is just frame rate (discrete steps). So rather simple.

Sharlin 701 days ago

This is not "just" video, however. It's interactive in real time. Sure, you can say that playing is simply video with some extra parameters thrown in to encode player input, but still.

slashdave 701 days ago

It is just video. There are no external interactions.

Heck, it is far simpler than video, because the point of view and frame is fixed.

SeanAnderson 701 days ago

I think you're mistaken. The abstract says it's interactive, "We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction"

Further - "a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions." specifically "and actions"

User input is being fed into this system and subsequent frames take that into account. The user is "actually" firing a gun.

slashdave 701 days ago

No, I am not. The interaction is part of the training, and is used during inference, but it is not including during the process of generation.

SeanAnderson 701 days ago

Okay, I think you're right. My mistake. I read through the paper more closely and I found the abstract to be a bit misleading compared to the contents. Sorry.

smusamashah 701 days ago

It's interactive but can it go beyond what it learned from the videos. As in, can the camera break free and roam around the map from different angles? I don't think it will be able to do that at all. There are still a few hallucinations in this rendering, it doesn't look it understands 3d.

Sharlin 701 days ago

You might be surprised. Generating views from novel angles based on a single image is not novel, and if anything, this model has more than a single frame as input. I’d wager that it’s quite able to extrapolate DOOM-like corridors and rooms even if it hasn’t seen the exact place during training. And sure, it’s imperfect but on the other hand it works in real time on a single TPU.

hypertele-Xii 701 days ago

Then why do monsters become blurry smudgy messes when shot? That looks like a video compression artifact of a neural network attempting to replicate low-structure image (source material contains guts exploding, very un-structured visual).

Sharlin 701 days ago

Uh, maybe because monster death animations make up a small part of the training material (ie. gameplay) so the model has not learned to reproduce them very well?

There cannot be "video compression artifacts" because it hasn’t even seen any compressed video during training, as far as I can see.

Seriously, how is this even a discussion? The article is clear that the novel thing is that this is real-time frame generation conditioned on the previous frame(s) AND player actions. Just generating video would be nothing new.

nopakos 701 days ago

Maybe it's so advanced, it knows the players' next moves, so it is a video!

slashdave 701 days ago

I guess you are being sarcastic, except this is precisely what it is doing. And it's not hard: player movement is low information and probably not the hardest part of the model.

raincole 701 days ago

?

I highly suggest you to read the paper briefly before commenting on the topic. The whole point is that it's not just generating a video.

slashdave 701 days ago

I did. It is generating a video, using latent information on player actions during the process (which it also predicts). It is not interactive.

Sharlin 700 days ago

Uff, I guess you’re right. Mea culpa. I misread their diagram to represent inference when it was about training instead. The latter is conditioned on actions, but… how do they generate the actual output frames then? What’s the input? Is it just image-to-image based on the previous frame? The paper doesn’t seem to explain the inference part at all well :(

slashdave 700 days ago

It should be possible to generate an initial image from Gaussian noise, including the latent information on player position

InDubioProRubio 701 days ago

Video is also higher resolution, as the pixels flip for the high resolution world by moving through it. Swivelling your head without glasses, even the blurry world contains more information in the curve of pixelchange.

slashdave 701 days ago

Correct, for the sprites. However, the walls in Doom are texture mapped, and so have the same issue as videos. Interesting, though, because I assume the antialiasing is something approximate, given the extreme demands on CPUs of the era.