| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by singhrac 1068 days ago
	No need for a Hilbert curve, you can just flatten pixels the usual way (ie X = img.reshape(-1)). The main issue is that attention doesn’t scale that well, and with a 512x512 img the attended region is now 262k tokens, which is a lot. The other issue is that you’d throw away data linearizing colors (why not keep them 3-dimensional?). The corresponding work you’re looking for is Vision Transformers (ViT) - they work well, but not as great as LLMs, I think, for generation. Also I think people like that diffusion models are comparatively small and expensive - they’d rather wait than OOM.

1 comments