|
|
|
|
|
by singhrac
1068 days ago
|
|
No need for a Hilbert curve, you can just flatten pixels the usual way (ie X = img.reshape(-1)). The main issue is that attention doesn’t scale that well, and with a 512x512 img the attended region is now 262k tokens, which is a lot. The other issue is that you’d throw away data linearizing colors (why not keep them 3-dimensional?). The corresponding work you’re looking for is Vision Transformers (ViT) - they work well, but not as great as LLMs, I think, for generation. Also I think people like that diffusion models are comparatively small and expensive - they’d rather wait than OOM. |
|