Hacker News new | ask | show | jobs
by xvector 2349 days ago
Can you also hash every key to effectively turn them into a number, and then perform radix sort?

If you hash with SHA-256 then any sorting operation seems like it’d just take O(256N)=O(N).

What am I missing? Clearly this cannot work because otherwise everyone would do this as we accept O(NlogN) as the norm!

3 comments

> What am I missing? Clearly this cannot work because otherwise everyone would do this as we accept O(NlogN) as the norm!

Let's assume a perfect hash that never collides or leaves gaps, the best case for this.

For a radix sort in some base with K digits, N can only go so high before you have duplicates. Specifically, base^K >= N

Take the log of both sides. K >= logN

So with arbitrarily large numbers the time taken to sort is actually NlogN.

With fixed-size numbers it's "O(K*N)" but K is larger than logN would be.

I see. Then Radix Sort never really was faster than traditional sorts if we go by Big-O notation alone, right?

If I understand your post correctly, Radix sorts in O(log(maxsize) * N) instead of O(NlogN). So why do people say Radix Sort is O(N)?

People say that because it's true, for a bounded key size, and in practice many radix sort implementations have a large fixed key size, so the log factor just doesn't come up.

E.g., if you use a 64-bit key, with a default implementation, you support up to 2^64 elements with your O(n) sort. No one has a machine with more than 2^64 elements today, so the the sort is in practical terms very much like O(n).

OTOH the O(n log n) of comparison based sorts are very real: the log term is in n, the number of inputs, and it is very obvious when you plot sort performance.

What is interesting though is when you start optimizing radix sort in certain ways, e.g., eliminating passes based on the high bits being zero (in an LSD sort) or using an MSD sort which stops when the bucket size gets smaller than some threshold, the log n factor actually appears again: since you often need ~log n passes to complete the sort rather than say a fixed number like log 64 (base = digit bits) = 8 for 64-bit keys. The base of the logarithm is larger though (2^radix bits), so the hidden constant factor is lower than comparison sort where the base is generally 2.

One answer is that if you allow duplicates then it can handle larger amounts of items without any slowdown.

Another answer is that people are silly and constant vs. log factors aren't intuitive.

Calculating SHA256 is neither O(1) nor cheap.

Your idea is interesting and might be useful in some applications! But calculating a SHA256 takes time linear in the length of the thing you're hashing. Even if you assume fixed-size keys, SHA256 isn't fast. It's not designed to be fast.

You'd likely need an extremely high N before O(N) SHA256's is faster than O(N log N) comparisons.

> But calculating a SHA256 takes time linear in the length of the thing you're hashing.

Ahem, so does hashing.

And so does sort comparison.

The GP's idea is actually quite effective for medium-to-large objects.

As a bonus, you can cache and store the SHA256 of each object for future in-set testing and uniqueness-finding without looking at the objects themselves. (You cannot do this with non-cryptographic hashes or sorting by comparisons.)

This is basically what Git does.

Well SHA256 is a red herring here: one could suggest any suitable fast hash, non-crytopgraphic hash instead.

Yes, it takes time linear in the length of the thing you are sorting, but comparison sort is generally worse in that respect: comparison also takes up to linear time in the comprands, and you have to compare each element not just once but log n times [1].

So the time to hash all the elements once (if the hash is not already cached in the element) is going to be a small part of the sort time.

[1] There are ways around repeatedly re-comparing the entire key, but they work mostly in specialized cases.

You shouldn't accept O(n log n) as the norm: radix sort often gives you O(n) in practice (yes it is really something like O(n log k) for key size k, but k is often fixed, e.g., at 64 bits).

I think radix sorts are probably less popular since, the standard for sorting has always been to provide a "comparator" that compares two objects, and radix sort wants something different entirely: an key you can extract bits from instead. So radix sort was never a drop-in replacement e.g., for qsort in C.

Often you can provide such a key object (and indeed in trivial cases like sorting integers, the entire object is simply the key), but sometimes it would need to be longer than 32 or 64 bits, so the best interface isn't clear: especially in oldschool languages like C which want to pass function pointers as the comprand.

This was kind of cemented when e.g., C++ standard library made == and < the standard things you need to implement not just for std::sort, but also std::map and std::set and many algorithms that rely on order. It's similar in a way to how std::unordered_set/map wasn't a thing until much later when the hashing concept was introduced.

When you can use it, though, radix sort is great: sometimes an order of magnitude faster than comparison sort.