|
> Current batch of deep learning models are fundamentally a technology for labor automation. This is immensely useful in itself, without the need to do AGI. The Sora2 capabilities are absolutely wild (see a great example here of what non-professional users are already able to create with it: https://www.youtube.com/watch?v=HXp8_w3XzgU ) > So only looking at video capabilities, or at coding capabilities, it's already ready to automate and upend industries worth trillions in the long run. Can Sora2 change the framing of a picture without changing the global scene ? Can it change the temperature of a specific light source ? Can it generate a 8k HDR footage suitable for re-framing and color grading ? Can it generate minute long video without loosing coherence ? Actually, can it generate a few seconds without having to reloop with the last frame and have these obnoxious cuts that the video you pointed has ?
Can it reshoot the same exact scene with just one element altered ? All the video models right now are only good at making short, low-res, barely post-processable video. The kind of stuff you see on social media. And considering the metrics on ai-generated video on social media right now, for the most part, nobody want to look at them. They might replace the bottom of the barrel of social media posting (hello cute puppy videos), but there is absolutely nothing indicating that they migth automate or upend any real industry (be used in the pipeline, yeah maybe, why not, automate ? Won't hold my breath). And the argument of their future capabilities, well ... It's been 50+ years that we should have fusion in 20 years. Btw, the same argument can be made for LLM and image-gen tech in any creative purposes. People severly underestimate just how much editing, re-work, purpose and pre-production steps are involved in any major creative endeavor. Most model are just severly ill suited for that work. They can be useful for some stuff (specificaly, for editing images, ai-driven image fill do work decently for exemple), but overall, as of right now, they are mostly good at making low quality content. Which is fine I guess, there is a market for it, but it was already a market that was not keen on spending money. |
Qwen image and nano banana can both do that with images, there’s zero reason to think we can’t train video models for masking.
This feels a lot like critiquing stable diffusion over hands and text, which the new SOTA models all handle well.
One of the easiest iterations on these models is to add more training cases to the benchmarks. That’s a timeline of months, not comparable to forecasting progress over 20 years like fusion.