Hacker News new | ask | show | jobs
by throwaway_bad 2442 days ago
People do deduplicate files to save on space, except it's usually based on exact byte match using md5 or sha256. Some don't due to privacy issues: https://news.ycombinator.com/item?id=2438181 (e.g., MPAA can upload all their torrented movies and see which ones uploaded instantly to prove that your system has their copyrighted files)

There's no way to make the UX work out for images that are only similar. Would be pretty wild to upload a picture of myself just to see a picture of my twin used instead.

But I do wonder if it's possible to deduplicate different resolutions of an image that only differ in upscaling/downscaling algorithm and compression level used (thereby solving the jpeg erosion problem: https://xkcd.com/1683/)

1 comments

The cnn methods in the package are particularly robust against resolution differences. In fact, if it's just a simple up/downscale that differentiates 2 images, then even hashing algorithms could be expected to do a good job.