Hacker News new | ask | show | jobs
by bane 4010 days ago
Part of the problem when estimating hash complexity is that what's usually considered is something that's basically memory + offset. Basically a few mov's and an add.

However, this is absolutely dominated by complexity of the hashing function, growth (which can be amortized, but is not O(1)) deletion (also not O(1)) and comparison functions (which are usually O(mn) or O(n) (or some similar depending)).

We end up measuring the things that are the most minimal in hashing and pretending like the expensive operations, which aren't O(1), don't exist. This is wrong and dishonest when considering algorithms. We know for example that string comparison is not O(1), and it's usually a part of most hash table algorithms, and yet it magically disappears when analyzing hash table complexity and everything is somehow supposed to be considered as a mem+offset which is stupid.

Hash-tables also usually have exponential memory growth complexity which nobody ever pays attention to. Of course there are versions which don't do this and/or have fixed memory sizes, but the random kind you find in most languages grow O(n^2) or similar. And this is also ignored and we pretend it doesn't happen and resizing magically becomes part of the "1" in O(1)....even though copying arrays isn't free and as resizes happen arrays get bigger and big-O should get worse.

Hash-table operations can be pretty expensive and faulty big-O analysis doesn't help. So sure, for a static, fixed size, hash table, with no growth, instantaneous hash functions that use temporal oracles, don't need to deal with collisions, O(1) is correct. But these hash functions pretty much don't exit in nature, and most programmers won't be working with them. The standard libraries for most languages sure as hell don't implement such ideal hash tables.

O(n) is at least an honest approximation. I honestly have never seen a well considered analysis of hash function complexity but I know it grows super-linearly from just using them a lot. This kind of fanciful analysis doesn't help anybody.

So yes, O(something) where something has a unit. But the unit sure as heck isn't whatever "1" is supposed to represent. It's probably closer to string length or array length (or some combination of the two), but a single "add" it is most definitely not.

1 comments

All of what you're saying is true, about the speed of real algorithms running on real hardware with real problem sizes.

But if you're talking about big-O, you are explicitly not talking about that. You're talking about how the speed of the algorithm hypothetically scales as some parameter tends to infinity.

To wit, O(n) doesn't mean "this algorithm takes kn time to run for a given n", it means "this algorithm's runtime for all n > c for is bounded above by nk for some c and k".

Sound like a analytic club that's rarely accurate to real-world performance? Yup, that's big-O :)

Right, so I'm asserting that the parameter usually called for in big-O analysis of hash table operations is the wrong one since it measures the lookup complexity and not the hashing complexity, which is usually O(n). And this results in meaningless complexity analysis which gives you things like "Insert is O(1) on Average but O(n) Worst Case or Θ(n)".

This analysis is using the hash table length as the parameter under consideration, but that's silly, because most of the complexity of hash-tables is in the hashing which (depending on the hash function) usually has a known complexity of O(n). Where n is the length of the input string.

This is far more important in hash table complexity analysis than table size because this part of the operation dominates runtime.

You can also do multi-paramter big-O analysis. O(mn) is a thing for example, with two differently sized parameters, both of which contribute to the complexity of the algorithm and can't be dismissed.

So charitably, if you need to provide a big-O for say, hash table inserts, it's reasonable to say O(mn) where m is the length of the table and n is the length of the input string, but it's not necessary since table length usually has little contribution to the complexity of the algorithm...hence why people keep saying inserts are O(1) on average, because table length isn't much of a determinant in hash table operation complexity. Just like big-O hides constants, we can hide parameters that are dominated by another. O(1) is doing this backwards.

My guess is that O(1) is some weird cargo-culted remnant from some early work done on Hash tables as a generalization of Arrays, likely from when the hash functions were just modulus operations on integer keys or some other simple, fixed length hashing method that was easy to ignore (like the universal hash in Corman's "Introduction to Algorithms". But modern hash-tables can't make these assumptions and I, and quite a few other folks, think that this needs to be rethought.

Some examples (I don't agree with all of these, but I think it makes the point):

https://stackoverflow.com/questions/2771368/can-hash-tables-...

http://lemire.me/blog/archives/2009/08/18/do-hash-tables-wor...

> Sound like a analytic club that's rarely accurate to real-world performance? Yup, that's big-O

If your complexity analysis isn't measurable by real-world performance, it's likely that you aren't analyzing the correct parameters.

> And this results in meaningless complexity analysis which gives you things like "Insert is O(1) on Average but O(n) Worst Case or Θ(n)".

I dunno, that just sounds like a time complexity to me. Quick, what's the time complexity of quicksort?

> This analysis is using the hash table length as the parameter under consideration, but that's silly

I think you're just talking about something else, then? Sure, the analysis generally assumes a finite key size, and looks at performance as the table size increases. That's just pragmatism; people generally have a bounded key size and an unbounded amount of data.

If your complaint is that treating the key length as finite results in a complexity of O(1), then... that's the point. Treating the key length as finite results in a complexity of O(1).

Table size isn't much of a determinant. It isn't any of a determinant, on average. Only the key length matters. That conclusion is the whole point of this analysis.

> it's reasonable to say O(mn)

I'm confused if this is what you meant to write-- this is not an accurate complexity for a hash table insert because, as you have pointed out, a hash table insert doesn't depend on the table size. There should be no factor of m. Edit: Er, technically O(n) is in O(mn)? Is that your point? But O(mn) doesn't simplify to O(n) unless m is constant, which I don't think you're saying.

With respect to key size n and table size m, the average complexity should be O(n). If we let key size be finite relative to the size of the table, that gives us O(1). But if you don't like that, you can let key size grow to infinity and you're right back to O(n).

None of this is going to tell you if it's the right data structure to use, though.

> If your complexity analysis isn't measurable by real-world performance, it's likely that you aren't analyzing the correct parameters.

No, in this case you are looking at the correct parameters but using the wrong model. At least, I think; it's still not clear what you're trying to get done here where big-O is letting you down.

On a hash with a million elements, the O(n) hashing of a 10-char key is negligible.

Also, you have to make apples to apples comparisons. In this case, you're comparing the time to search against the number of elements, and that's it. If you want the time to hash AND search - as a function of n - then your analysis holds, but you can't then compare that against other datastructures that do not have an equivalent hash step.