| As a distributed systems engineer, we are a LONG way from "magical scalable ai". The bottleneck for a developing AI is experience. Yes we need compute, but we need data to compute on. We have bypassed that limit by starting with literally every scrap of human generated prose that ever existed. I expect an explosion of expansion when visual and world models hit critical mass to properly leverage new experiences. But even then, engaging with reality is the bottleneck. I can build you a very efficient scalable online map-reduce-like that runs inference on new corpus. We already made that. It took hardware getting large enough to fit the corpus in memory, instead of "scaling" it with networks for it to be viable. The latency of the network passing around partial solutions was WAY too high. Computers don't scale forever. They are made of hot metals. The limits are heat, material, and the speed of light, but those are very real limits, that don't offer more than a constant multiplier of advantage over meat. AIs might get smarter than us, arguably, like many other meat and paper based super-human intelligences around us, they already are. But it doesn't scale forever. It will hit limits, fairly quickly, of compute and experience to integrate into it's overfit model. |
And, so far, the results of "visual data for improving general intelligence" runs were nothing but disappointments.
I think vision is just a piss poor modality to learn intelligence from? Very low value, per bit and per token both. You only ever want to tap it if you need your AI to operate based on visual data at deployment time. Otherwise, even "experience" is best gathered in text RLVR rollouts.
The secret of human sample efficiency isn't that visual data is somehow better for learning intelligence. It just isn't. Human "training data" is a hundred kinds of awful - humans are just good at scavenging it for all its worth. Evolution has tuned that very well.
Which means: AIs can get good at it too. It's not a wall - it's a skill issue.