| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nl 158 days ago
	Isn't this just an awkward way of adding an extra layer to the NN, except without end-to-end training? Models like Stable Diffusion sort of do a similar thing using Clip embeddings. It works, and it's an easy way to benefit from the pre-training Clip has. But for a language model it would seemingly make more sense to just add the extra layer.

1 comments

I mean this is exactly what it is. Just a wrapper to replace the tokenizer. That is exactly how LLMs can read images.

I'm just focusing on different parts