Hacker News new | ask | show | jobs
by jiggawatts 857 days ago
Adding to this: Sora was most likely trained on video that's more like what you'd normally see on YouTube or in a clip art or media licensing company collection. Basically, video designed to look good as a part of a film or similar production.

So right now, Sora is predicting "Hollywood style" content, with cuts, camera motions, etc... all much like what you'd expect to see in an edited film.

Nothing stops someone (including OpenAI) from training the same architecture with "real world captures".

Imagine telling a bunch of warehouse workers that for "safety" they all need to wear a GoPro-like action camera on their helmets that record everything inside the work area. Run that in a bunch of warehouses with varying sizes, content, and forklifts, and then pump all of that through this architecture to train it. Include the instructions given to the staff from the ERP system as well as the transcribed audio as the text prompt.

Ta-da.

You have yourself an AI that can control a robot using the same action camera as its vision input. It will be able to follow instructions from the ERP, listen to spoken instructions, and even respond with a natural voice. It'll even be able to handle scenarios such as spills, breaks, or other accidents... just like the humans in its training data did. This is basically what vehicle auto-pilots do, but on steroids.

Sure, the computer power required for this is outrageously expensive right now, but give it ten to twenty years and... no more manual labour.