Hacker News new | ask | show | jobs
by AnotherGoodName 384 days ago
Yes the thing that got me was i went through the channels multiple times (multiple browser sessions). The channels are the same everytime (the numbers don't align to any navigation though - flip back and forth between two numbers and you'll just hit a random channel everytime - don't be fooled by that). Every object is in the same position and the layout is the same.

What makes this AI generated over just rendering a generated 3D scene?

Like it may seem impressive to have no glitches (often in AI generated works you can turn around a full rotation and you're what's in front of you isn't what was there originally) but here it just acts as a fully modelled 3D scene rendering at low resolution? I can't even walk outside of certain bounds which doesn't make sense if this really is generated on the fly.

This needs a lot of skepticism and i'm surprised you're the first commenting on the lack of actual generation here. It's a series of static scenes rendered at low fidelity with limited bounds.

2 comments

Ok playing with this more there's very subtle differences between sessions. As in there is some hallucination here with certain small differences.

I think what's happening is this is AI generated but it is very very overfitted to real world 3D scenes. The AI is almost rendering exactly a real world scene and not much more. They can't travel out of bounds or the model stops working since it's so overfitted to these scenes. The overfitting solves hallucinations but it also makes it almost indistinguishable from pre modelled 3D scenes.

I think the most likely explanation is that they trained a diffusion WM (like DIAMOND) on video rollouts recorded from within a 3D scene representation (like NeRF/GS), with some collision detection enabled.

This would explain:

1. How collisions / teleportation work and why they're so rigid (the WM is mimicking hand-implemented scene-bounds logic)

2. Why the scenes are static and, in the case of should-be-dynamic elements like water/people/candles, blurred (the WM is mimicking artifacts from the 3D representation)

3. Why they are confident that "There's no map or explicit 3D representation in the outputs. This is a diffusion model, and video in/out" https://x.com/olivercameron/status/1927852361579647398 (the final product is indeed a diffusion WM trained on videos, they just have a complicated pipeline for getting those training videos)

Odyssey Systems is six months behind way more impressive demos. They're following in the footsteps of this work:

- Open Source Diamond WM that you can run on consumer hardware [1]

- Google's Genie 2 (way better than this) [2]

- Oasis [3]

[1] https://diamond-wm.github.io/

[2] https://deepmind.google/discover/blog/genie-2-a-large-scale-...

[3] https://oasis.decart.ai/welcome

There are a lot of papers and demos in this space. They have the same artifacts.

All of this is really great work, and I'm excited to see great labs pushing this research forward.

From our perspective, what separates our work is two things:

1. Our model is able to be experienced by anyone today, and in real-time at 30 FPS.

2. Our data domain is real-world, meaning learning life-like pixels and actions. This is, from our perspective, more complex than learning from a video game.

Is it possible that this behavior is a result from training on Google Maps or something similar? I tried to walk off a bridge and you get completely stuck, which is the only reason I can think of that, other than not having first person video views of people walking off bridges.