| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by looobay 246 days ago
	You should read the 6th page of the paper (and page 5 for architecture breakdown), they show that they are compressing the vision tokens with convolution to keep a strong semantic understanding and keep a small amount of tokens. But I think it's still experimentall.