I used something like this a few years ago in a project sort of similar to this one. There's a bunch of parsing and processing to do with that, and the "0.3" value is ... fiddly, but it worked pretty well:
For this project, I want to find an A.I. solution for finding the most 'interesting' frames. Not even sure how to measure interestingness yet, might be the presence of text, the presence of a human ...