I get excited for every new vision model, especially those that work better and more efficiently. Vision is where we are so very far behind.. I can’t wrap my head around it
What do you mean far behind? Far behind what? The new (actually the old one too) Qwen can give you bounding rectangular prisms around things in a scene, OCR text with ink spilled on it correctly, read graphs and understand spatial relationships, I think it's pretty impressive for something I'm running on like a 5 year old GPU.
yeah i know lol, that’s kind of my point. impressive that it runs on your gpu, but it still can’t tell you what happens if you tilt a glass. that’s what world models are working toward. but even then..so what? you get a perfect simulator. it knows the glass tips. it still doesn’t know why someone tipped it, or what happens if they don’t. A four year old can do this and we’re just barely on step one and a half.