|
I feel LeCun got roped in debating the likes of Marcus and Yudkowsky. This has made his arguments lose nuance and become rigid. I also can't escape the feeling that if Facebook was tuned into Transformers, they would have shipped earlier, so there must have been some resistance or underestimation that's now repeated "They can't reason", "They can't plan", "They can't understand the world", "They are a distraction / side road to AGI". It is kind of ironic that researchers who claim LLMs lack adaptive intelligence seemingly refuse to adapt their intelligence to LLMs. If even GPT-3 can find logical holes or oversimplification in your arguments about GPTs, at one point this starts becoming embarrassing and unbecoming. > The generation of mostly realistic-looking videos from prompts does not indicate that a system understands the physical world. While arguably true, it also does not indicate that a system does not understand the physical world (reflections, collision detection, gravity, object permanence, long-term scene coherence, etc.). If LeCun wants to argue it does not understand the physical world, he should do so directly. Not attack something that is not directly stated, but rather convincingly and tentatively demo'd (I myself find it hard to argue that a system that generates novel pond reflections has not memorized/stored in weights some generalization program to apply to realistic scene generation). This demo shows it is not even a wild prediction to guess that soon (consumer tech) we will be able to discuss visual scenes with conversational AIs. |
FWIW none of the video models released so far demonstrate any object coherence whatsoever, which suggests they don't have the higher level capabilities you mention yet.
In Sora, as soon as an object is obstructed by an obstacle or goes offscreen, it's likely to disappear or be radically transformed.