| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by FeepingCreature 39 days ago
	That's actually how vision language models already work, pretty much.

2 comments

wongarsu 39 days ago

And there's a reason nobody uses them for face recognition

Vision language models are an incredible achievement in the generality and usability. But they pay a hefty price in fidelity and speed

link

stingraycharles 39 days ago

Huh? The images are tokenized in the same way language is and it’s just fed into one single model. Not multiple smaller expert models.

Image gets rasterized into smaller pieces (eg 4x4 pixels) and each of those is assigned a token, similarly how text is broken up into tokens. And the whole thing is fed into a single model.

link

FeepingCreature 39 days ago

Yes I'm saying

> Imagine face recognition to work like a text chat, where the PC gets the frame from the camera and writes in the chat: "Who's that? Here's the RGB888 image in hex: ...".

that's p much how it works.

link

stingraycharles 39 days ago

But that isn’t a specialized model like the grandparent claimed, but rather a single, multi-modal model.

link

Dylan16807 39 days ago

Yes, the "imagine" was showcasing the opposite of a specialized model to call it a bad idea.

link