| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by goldwelder42 1176 days ago
	With CNNs they definitely go up abstraction layers from pixels -> lines -> curves -> abstract shapes. So if CNNs can do that then I assume that transformers can do something similar with language. But it's tough to prove because the way you do that with CNNs is you just visualize the output at each layer into an image. With a language model you have to discretize the embeddings into tokens and that isn't straigtforward. I wonder if multimodal LLMs will be able to ground these points with reality since it can connect language and images together.