We are talking about our underlying tech at Next conference - https://goo.gl/3ihXth
We are clearly using frame level annotations, but we also have additional models to aggregate visual and additional information to provide aggregate level entities at the shot level or video level. PM at Google
it would be completely naive to implement it that way, considering there is an entirely new attribute video applies over images which of course is "time".
I don't know shit about ML- talking out of my ass here- but I'd be surprised if the algorithms didn't account for changes over time or canonical entity recognition (is this the same boat that was in the last image)?
The linked press release shows an animal is detected -- tiger etc. It does not say tiger running or hunting, which is where the time component would have been used.
> nouns such as “dog,” “flower” or “human” or verbs such as “run,” “swim" or “fly”
that out of the way... i suspect you wouldn't need video to detect those things...
and the screenshot you're referring to is an specific application of the API... not a kitchen sink:
> It can even provide contextual understanding of when those entities appear; for example, searching for “Tiger” would find all precise shots containing tigers across a video collection in Google Cloud Storage.