| As other posters have commented, hashing schemes like this are not really robust to a few very simple transformations, such as slight offsets in time, and cannot detect if one video is a clip from another. Practically you can only expect such a tool to match up "reencodings" from one format to another. Because of this IMO it's worth including the video length as a separate field within the hash. That way when you are doing a search you can sort the videos by length then only calculating the hamming distance for videos of similar length (a short video will never be a duplicate of a long one even if the perceptual hashes are close) Unfortunately when all videos are the same length this doesn't change the O(n*2) time complexity, but assuming that they are not, this optimization should give a significant search time benefit and false positive reduction for large datasets. I haven't actually checked what the author is doing, but for the sake of not having to decode entire videos (very CPU intensive) it's worth limiting the hash generation process to a small portion from the start of the video (somewhere between 30 seconds and 5 minutes has worked for me depending on the content). As well as being saving time this helps you detect reencodings where the time base is slightly altered (the frames will diverge more and more as time goes on) (just adding some of my own experience from my own similar video hashing project!) edit: inb4 someone complains about O(n*2) and mentions BK trees, for numbers up to ~500k hashes in my own testset a BK tree has always been slower than doing the naive search over a neatly aligned stripe of sorted-by-video-length hashes in memory. (Maybe I need to learn how to do memory arenas for cache locality) (or maybe I need to make my own video hashes be not 500 bits long) |