Hacker News new | ask | show | jobs
by zaptrem 1251 days ago
You might also mess with your loss function to force it to "make up its mind" as right now the blurry mess likely minimizes the error from the actual frame (which isn't really want you want).
1 comments

Exactly! That was the exact thing I was trying to think of a way to do.

Got any ideas? There’s discriminators, but after reading over prior work, it seems like they help, but they weren’t really a groundbreaking / effective solution.

I had two harebrained ideas in mind. One is to add yolo style object detection. The difference between a blurry mess and a recognizable object is the fact that it’s a recognizable object, so minimizing the error wrt yolo might work. (“If there are more recognizable objects in the ground truth image than the generated image, penalize the network”)

The other was to try to make some kind of physics-based prediction of the world — if it knows roughly where a street is, or where a wall is relative to an object, then it’ll likely be less confused when generating objects. That idea is very nascent, but right now I’m attacking it by trying to get our RNN to predict an nbody simulation. (Two or three 2D circles that have a gravitational effect on each other, with bouncing when they collide.) The RNN is surprisingly okay at that, even though it’s only examining pixels, but it gets blurry. I was going to try to get it to spit out actual predictions of position, velocity, acceleration, radius in the hopes that it’ll be able to make a connection between “I know there’s a ball flying along this trajectory, so obviously it should still be there 3 frames from now.”

It seems like the more traditional solution is to add a loss term related to the optical flow of the image (displacement from the previous frame to current), or to do foreground/background segmentation masks and have it focus only on the foreground. Both of those feel like partial solutions though, and it feels like there should be some general way to “force it to make up its mind,” as you say. So if you have any oddball ideas (or professional solutions), I’d love to hear!

Have you checked RSSM approach in DreamerV1,V2,V3,PlaNet? It uses deterministic (GRU hidden state) and discrete stochastic latent states. The deterministic and stochastic (sampled) latent state are used to predict the next state. I think the stochastic state might help with your problem a bit.
Dear mystery HN’er, thank you so much. I hadn’t heard about RSSM, and your explanation was wonderfully helpful.

Much appreciated. Have a great weekend :)