| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by thethimble 351 days ago
	Vision transformers effectively encode a grid of pixel patches. It’s ultimately a matter of ensuring the position encoding incorporates both X and Y and position. For LLMs we only have one axis of position and - more importantly - the vast majority of training data only is oriented in this way.