Hacker News new | ask | show | jobs
by jmchambers 613 days ago
I _think_ I understand the basic premise behind stable diffusion, i.e., reverse the denoising process to generate realistic images but, as far as I know, this is always done at the pixel level. Is there any research attempting to do this at the 3D asset level, i.e., subbing in game engine assets (with position and orientation) until a plausible scene is recreated? If it were possible to do it that way, couldn't it "dream" up real maps, with real physics, and so avoid the somewhat noisy output these types of demo generate?
6 comments

I think the closest we have right now is 3D gaussian splatting.

So far it's only been used to train a scene from photographs from multiple angles and rebuild it volumetrically by adjusting densities in a point-cloud.

But it might be possible to train a model on multiple different scenes, and perform diffusion on a random point cloud to generate new scenes.

Rendering a point cloud in real time is also very efficient, so it could be used to create insanely realistic game worlds instead of polygonal geometry.

It seems someone already thought of that: https://ar5iv.labs.arxiv.org/html/2311.11221

Interesting, I guess that takes things even further and removes the need for hand-crafted 3D assets altogether, which is probably how things will end up going in gaming, long-term.

I was suggesting a more modest approach, I guess, one where the reverse-denoising process involves picking and placing existing 3D assets, e.g., those in GTA 5, so that the process is actually building a plausible map, using those 3D assets, but on the fly...

Turn your car right and a plausible street decorated with buildings, trees and people is dreamt up by the algorithm. All the lighting and physics would still be done in-engine, with stable diffusion acting as a dynamic map creator, with an inherent knowledge of how to decorate a street with a plausible mix of assets.

I suppose it could form the basis of a procedurally generated game world where, given the same random seed, it could generate whole cities or landscapes that would be the same on each player's machine. Just an idea...

The thing is that, there are generators that can do exactly this, no need to have an LLM as the middle man. Things like terrain generation, city generation, crowd control, character generation, can be done quite easily with far less compute and energy.
Someone has to write those by hand, and they don't generalize.

Diffusion based generators will do everything soon. And in every style imaginable.

We'll probably solve the energy issue in time.

Technically I guess one could do a stable diffusion-like model except on voxels, where instead of pixel intensity values it producing a scalar field which you could turn into geometry using marching cubes or something similar.

Not sure how efficient that would be though, and would only work for assets like teapots and whatnot, not whole game maps say.

That's a simplified version of what a point cloud stores, but only works with cubes then.

A point cloud is basically a 3D texture of colors and densities, so a raymarching algorithm can traverse it adding densities it collides with to find the final fragment color. That's how realistic fog and clouds are rendered in games nowadays, and it's very fast, except they use a noise function instead of a scene model.

> A point cloud is basically a 3D texture of colors and densities

That's not how I'm familiar with it. As I know it[1], a point cloud is literally that, a collection of individual points, that represents an object scene.

While what you describe is like the scalar field[2] I mentioned, each position in space has some value. You can render them directly like you say, I was thinking to extract geometry a level-set method could be interesting.

[1]: https://en.wikipedia.org/wiki/Point_cloud

[2]: https://en.wikipedia.org/wiki/Scalar_field

> but, as far as I know, this is always done at the pixel level

Image models are NOT denoised at the pixel level - diffusion happens in latent space. This was one of the big breakthroughs that made all of this work well.

There's a model for encoding/decoding between pixels and latent space. Latent space is able to encode whatever concepts it needs in whichever of its dimensions it needs, and is generally lower dimensional than pixel space. So we get a noisy latent space, denoise it using the diffusion model, then use the other model (variational autoencoder) to decode into pixel space.

Not exactly 3D assets, but diffusion modems are used to generate e.g. traffic (vehicle trajectories) for evaluating autonomous vehicle algorithms. These vehicles tend to crash quite a lot.

For example https://github.com/NVlabs/CTG

Edit: fixed link

Generating this at pixel level is the next level thing. The reverse engineering method your described is probably appealing because it's easier to understand.

Focusing on pixel level generation is the right approach I think. The somewhat noisy output will be improved upon probably in a short timeframe. Now that they proved with Doom (https://gamengen.github.io/) and this that it's possible, probably more research is happening currently to nail the correct architecture to scale this to HD and minimal hallucination. It happened with videos alredy so we should see a similar level breakthrough soon.

> I _think_ I understand the basic premise behind stable diffusion, i.e., reverse the denoising process to generate realistic images but, as far as I know, this is always done at the pixel level.

It's typically not done at the pixel level, but at the "latent space" level of e.g. a VAE. The image generation is done in this space, which has fewer outputs than the pixels of the final image, and then converted to the pixels using the VAE.

Frantically Googles VAE...

Ah, okay, so the work is done at a different level of abstraction, didn't know that. But I guess it's still a pixel-related abstraction, and it is converted back to pixels to generate the final image?

I suppose in my proposed (and probably implausible) algorithm, that different level of abstraction might be loosely analogous to collections of related game engine assets that are often used together, so that the denoising algorithm might be effectively saying things like "we'll put some building-related assets here-ish, and some park-related flora assets over here...", and then that gets crystallised in to actual placement of individual assets in the post-processing step.

(High level, specifics are definitely wrong here)

The VAE isn't really pixel-level, it's semantic-level. The most significant bits in the encoding are like "how light or dark is the image" and then towards the other end bits represent more niche things like "if it's an image of a person, make them wear glasses". This is way more efficient than using raw pixels because it's so heavily compressed, there's less data. This was one of the big breakthroughs of stable diffusion compared to previous efforts like disco diffusion that work on the pixel level.

The VAE encodes and decodes images automatically. It's not something that's written, it's trained to understand the semantics of the images in the same way other neural nets are.

Stable diffusion is in latent space, not by pixel.