Ask HN: How do orgs detect conflicting hashes in sub-second timing?

Y	Hacker News new \| ask \| show \| jobs

3 points by adam_ellsworth 1457 days ago

Today I was considering how organizations such as bitly, imgur, reddit, and even github handle hash collision prevention.

While for GH it (seems) easier as commit hashes are sufficiently long/unique, for others with character hashes that only span ~6 chars or so, there's bound to be instances where hashes conflict from a statistical standpoint. (iirc reading an article on a hash collision in GH a few years ago here on ycomb)

To my mind these orgs have to have a suite tools/algos requesting information from multiple services, checking whether or not a hash has been taken – and those processes have to optimize for time. (e.g. when a user makes a post, what's a reasonable time to do a lookup?)

So, what are the considerations which need to be made algorithmically to check such collisions while keeping runtime to an acceptable minimum?

5 comments

verdverm 1457 days ago

Database indexes most likely. Consider GitHub...

1. Hashes are determined outside of their context, by git.

2. They store this is a database

3. Conflicts are so rare that they can be ignored. They must happen within the same repository. If they are in different repos, then it is not a conflict.

link

frogger8 1457 days ago

Not sure how these orgs do it but the system design book by Alex Xu uses timestamp + dataCenterID + machineId + sequence number that is reset every millisecond. This integer is then converted to base62.

link

manx 1457 days ago

If you have different workers to generate ids, give all of them their own ranges in advance. Once a worker used up their range, they can request a new range for their next N (e.g. 100k) ids.

link

compressedgas 1457 days ago

Do you mean "how do they do unique identifier generation"?

Most use a combination of precoordination and a database that when an insert fails they just generate new and try again.

link

ev1 1457 days ago

what parts specifically of those sites? most are not hashes.

link