|
|
|
|
|
by GaggiX
1095 days ago
|
|
This is not how it works, there is no sliding window, the model has no restrictions on the W and H dimensions, only on the C dim, so the UNet can actually be used directly on images with different aspect ratio and resolution; the attention layers pay attention to the entire image. |
|
From the Paella paper[2]: "Our proposal builds on the two-stage paradigm introduced by Esser et al. and consists of a Vector-quantized Generative Adversarial Network (VQGAN) for projecting the high dimensional images into a lower-dimensional latent space... [w]e use a pretrained VQGAN with an f=4 compression and a base resolution of 256×256×3, mapping the image to a latent resolution of 64×64indices." After training, in describing their token predictor architecture: "Our architecture consists of a U-Net-style encoder-decoder structure based on residual blocks,employing convolutional[sic] and attention in both, the encoder and decoder pathways."
U-Net, of course, is a convolutional neural network architecture. [3]. The "down" and "up" encoder/decoder blocks in the Paella code are batch-normed CNN layers. [4]
[1] https://arxiv.org/pdf/2012.09841.pdf [2] https://arxiv.org/pdf/2211.07292.pdf [3] https://arxiv.org/abs/1505.04597 [4] https://github.com/dome272/Paella/blob/main/src/modules.py#L...