Hacker News new | ask | show | jobs
by glenstein 445 days ago
>1. MS Office doesn't work on Linux, and for many people it's a hard requirement that they specifically need it.

Office 365 is web-based and Linux accesses the web. If you're deep enough in the weeds that your survival hinges on desktop only features that can only be accessed from Windows then you're in a use case that can be resolved by getting a laptop for that purpose or being furnished one by your work.

And while your use case is fascinatingly specific, my understanding is that the paradigm there is what might be called visual hashing or perceptual hashing, which is more than mere file size comparison, but kind of hashing a more generalized notion of image similarity.

You may already know this, but from checking with chatgpt, there's something called DupeGuru which appears to be cross-platform. And also, it looks like there's some powerful Python and Perl libraries. Again, I'm sure your use case has some specific wrinkles to it, and you may very well know all of that already, so those might not help. But I suppose the interesting thing here is that the more idiosyncratic a use case is, the more closely it approximates things solved by programming languages which puts you back in the paradigm where Linux is not merely usable but I would argue the friendliest option.

1 comments

Could you point me to some more information about "visual hashing"? I'm a bit tired of "try this tool because it worked for my set of 100 pictures", but if I could read an explanation why given tool/library does what I want, that would be fantastic. The biggest issue of my use case is the sheer number of files.
Right, you're looking for things that work at the scale of 100k separate files or so. Moreover you seem pretty used to getting bad recommendations, and I know the feeling. Important caveat is that all I know about these are what I've chatgpt'd about them.

There's the aforementioned DupeGuru program which is cross platform and wields a handful of algorithms. Then there's aHash (average hash), dHash (difference hash) , and pHash (perceptual hash). They each make assumptions about which subset of image data is important, pull it out, compare it, and are meant to do it quickly and at large scales. They are all accessible from within Pythons' imagehash library and require getting your hands dirty with python. My understanding is that Dupeguru uses its own custom perceptual hashing methods.

And although it seems like you need something more specific, the very very lazy choice is md5 sum comparison which is super fast but is only testing whether files are identical copies.

dHash sounds like a good starting point if I ever get to the situation where VisiPics doesn't work anymore for some reason. It's horribly difficult to replace software that is "just good enough", and all of its problems have known mitigations.