Hacker News new | ask | show | jobs
by AnotherGoodName 382 days ago
Ok playing with this more there's very subtle differences between sessions. As in there is some hallucination here with certain small differences.

I think what's happening is this is AI generated but it is very very overfitted to real world 3D scenes. The AI is almost rendering exactly a real world scene and not much more. They can't travel out of bounds or the model stops working since it's so overfitted to these scenes. The overfitting solves hallucinations but it also makes it almost indistinguishable from pre modelled 3D scenes.

2 comments

I think the most likely explanation is that they trained a diffusion WM (like DIAMOND) on video rollouts recorded from within a 3D scene representation (like NeRF/GS), with some collision detection enabled.

This would explain:

1. How collisions / teleportation work and why they're so rigid (the WM is mimicking hand-implemented scene-bounds logic)

2. Why the scenes are static and, in the case of should-be-dynamic elements like water/people/candles, blurred (the WM is mimicking artifacts from the 3D representation)

3. Why they are confident that "There's no map or explicit 3D representation in the outputs. This is a diffusion model, and video in/out" https://x.com/olivercameron/status/1927852361579647398 (the final product is indeed a diffusion WM trained on videos, they just have a complicated pipeline for getting those training videos)

Odyssey Systems is six months behind way more impressive demos. They're following in the footsteps of this work:

- Open Source Diamond WM that you can run on consumer hardware [1]

- Google's Genie 2 (way better than this) [2]

- Oasis [3]

[1] https://diamond-wm.github.io/

[2] https://deepmind.google/discover/blog/genie-2-a-large-scale-...

[3] https://oasis.decart.ai/welcome

There are a lot of papers and demos in this space. They have the same artifacts.

All of this is really great work, and I'm excited to see great labs pushing this research forward.

From our perspective, what separates our work is two things:

1. Our model is able to be experienced by anyone today, and in real-time at 30 FPS.

2. Our data domain is real-world, meaning learning life-like pixels and actions. This is, from our perspective, more complex than learning from a video game.