Hacker News new | ask | show | jobs
by varelaz 1707 days ago
I used similar approach for video hashing. Instead of interval I used key frames with ffmpeg, then you don't depend on codec. Also didn't rescale but took hash of every frame. For youtube I found that it still produces different hashes sometimes.

edit: to get only keyframes use select=eq(pict_type,I)

2 comments

I get that decoding only keyframes will be much faster, but how can codec independence be maintained when different codecs will insert keyframes at very different points?

Could such an algorithm ever find a duplicate between say a GIF (every frame is a keyframe) vs any modern codec with very few keyframes?

(or is this optimization specifically for videos known the be encoded with the exact same codec, and specifically with a static keyframe interval?)

Codecs are algorithms how to generate B and P frames. I frames are just jpegs. Yes, codecs can split video differently, but in case of the same split the same frame will be encoded the same way. In most cases key frame frequency is just a number. Some formats like HLS cannot work with variable key frame frequency at all. Why it matters, because different version of the same codec can replay the same video differently for B and P (no guarantee), but not I frames. So I frames are the most stable.
faster: -skip_frame nokey