| HN Mirror

Exactly! That was the exact thing I was trying to think of a way to do.

Got any ideas? There’s discriminators, but after reading over prior work, it seems like they help, but they weren’t really a groundbreaking / effective solution.

I had two harebrained ideas in mind. One is to add yolo style object detection. The difference between a blurry mess and a recognizable object is the fact that it’s a recognizable object, so minimizing the error wrt yolo might work. (“If there are more recognizable objects in the ground truth image than the generated image, penalize the network”)

The other was to try to make some kind of physics-based prediction of the world — if it knows roughly where a street is, or where a wall is relative to an object, then it’ll likely be less confused when generating objects. That idea is very nascent, but right now I’m attacking it by trying to get our RNN to predict an nbody simulation. (Two or three 2D circles that have a gravitational effect on each other, with bouncing when they collide.) The RNN is surprisingly okay at that, even though it’s only examining pixels, but it gets blurry. I was going to try to get it to spit out actual predictions of position, velocity, acceleration, radius in the hopes that it’ll be able to make a connection between “I know there’s a ball flying along this trajectory, so obviously it should still be there 3 frames from now.”

It seems like the more traditional solution is to add a loss term related to the optical flow of the image (displacement from the previous frame to current), or to do foreground/background segmentation masks and have it focus only on the foreground. Both of those feel like partial solutions though, and it feels like there should be some general way to “force it to make up its mind,” as you say. So if you have any oddball ideas (or professional solutions), I’d love to hear!