| Mmm… how is a model with a fixed size, let’s say, 512x512 (ie. 64x64 latent or whatever), able to output coherent images at a larger size, let’s say, 1024x1024? Not in a “kind of like this” kind of way: PyTorch vector pipelines can’t take arbitrary sized inputs at runtime right? If you input has shape [x, y, z] you cannot pass [2x, 2y, 2z] into it. Not… “it works but not very well”; like, it cannot execute the pipeline if the input dimensions aren’t exactly what they were when training. Right? Isn’t that how it works? So, is the image chunked into fixed patches and fed through in parts? Or something else? For example, (1) this toy implementation resizes the input image to match the expected input, and always emits an output of a specific fixed size. Which is what you would expect; but also, points to tools like stable diffusion working in a way that is distinctly different to what the trivial explanation tend to say does? [1] - https://github.com/uygarkurt/UNet-PyTorch/blob/main/inferenc... |