|
|
|
|
|
by dplavery92
1092 days ago
|
|
Presumably a transformer model or similar that uses positional encodings for the tokens could do that, but the U-Net decoder here uses a fixed-shape output and learns relationships between tokens (and sizes of image features) based on the positions of those tokens in a fixed-size vector. You could still apply this process convolutionally and slide the entire network around to generate an image that is an arbitrary multiple of the token size, but image content in one area of the image will only be "aware" of image content at a fixed-size neighborhood (e.g. 256x256). |
|