Hacker News new | ask | show | jobs
by viraptor 2599 days ago
You could say that hash is a very aggressive, lossy compression.

Same image will result in same hash. Different images will often result in different hashes. That makes it a valid hash.

1 comments

> You could say that hash is a very aggressive, lossy compression.

You could but it would be about as misleading as a sentence could possibly be.

The purpose of compression is to preserve content as much as possible: similar inputs should give similar outputs, and the output should provide as much information about the input as it can.

Hashing deliberately does the exact opposite - slightly different inputs should give wildly different outputs - as its primary purpose in the case of crypto hashes, and in the case of index hashes as a performance optimization (which is the primary purpose of index hashes).

If you want specifically a cryptographic/indexing hash, where 1 bit of change in input changes ~50% of bits in the output, then that's one possible goal/restriction. But that's just one kind of hash function.

But you can have hash functions with the goal of preserving similarity. For example soundex is a hash function with that constraint. From: https://pdfs.semanticscholar.org/06d6/8587c27058dd6ab3fb8238...

> For example value 1 = "Damieva" and value 2 = "Dameiva." These two values will produce the same Soundex hash value, creating a match.

There's also the whole class of LSH https://en.wikipedia.org/wiki/Locality-sensitive_hashing

From Wikipedia: "A hash function is any function that can be used to map data of arbitrary size onto data of a fixed size."

As for hash functions having the desired property of different inputs leading to vastly different outputs, that only describes one subset. There are many hash algorithms that aim to achieve precisely the opposite, such as simhash.