Video understanding is kind of new, especially if done well, and hopefully working well with UI and UX, that'd be great. Current agents already struggle a bit with 2D space with normal screenshots of unconventional UIs, wonder if this model would do better with actual recordings of navigating and using applications, feels like it could help a bunch with understanding UX at least hopefully. Will be fun to play around with :)
Sure, but again, it's a micro 3B model. Perhaps it can't be used for general video work, but it might be able to do basic edits like remove an object from a table in a shot.