| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jcattle 6 days ago

Very nice visualizations, thanks for that!

One thing I still struggle with in my head is how these vision embeddings can then be used to give LLMs eyes.

Because you somehow need a giant training set which describes images in natural language, no? Is that actually how it works, or is there some smart trick so you don't need to pay labellers a bunch of money to look at pictures and describe them.

2 comments

dilyevsky 6 days ago

> Because you somehow need a giant training set which describes images in natural language, no?

That's definitely one way - they train a text encoder together with an image encoder on a labelled set of images. WL & 3b1b made a nice video on it: https://www.youtube.com/watch?v=iv-5mZ_9CPY

link

jcattle 6 days ago

Thanks I'll check out that video

link

krackers 5 days ago

>which describes images in natural language,

See CLIP https://github.com/openai/CLIP

link