|
|
|
|
|
by Sacco215
106 days ago
|
|
That's really cool work! I've done some work in this area and here are my two cents: 1. Convolution-based architectures are terrible:
I've trained Convolution based architectures and they were almost never scalable. Lately I've switched to transformer based AE and they are soo much better! We even managed to get Chinchilla-style scaling laws out of Transformer AE. 2. VAEs are terrible for downstream tasks:
We've tried training Video diffusion models out of MAE and VAE (same architecture) and the MAE is hands down better. 3. This whole field is not science.
There is no rigorous way of defining what a "good latent" really is. End-to-end methods (such as PixNerd) are the future, since they eliminate the need to hand-design and optimize the interface between separate components. That being said I've never seen a neural-field based video model and I've done some limited experiments on it with underwhelming results |
|