| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bane 4009 days ago

Right, so I'm asserting that the parameter usually called for in big-O analysis of hash table operations is the wrong one since it measures the lookup complexity and not the hashing complexity, which is usually O(n). And this results in meaningless complexity analysis which gives you things like "Insert is O(1) on Average but O(n) Worst Case or Θ(n)".

This analysis is using the hash table length as the parameter under consideration, but that's silly, because most of the complexity of hash-tables is in the hashing which (depending on the hash function) usually has a known complexity of O(n). Where n is the length of the input string.

This is far more important in hash table complexity analysis than table size because this part of the operation dominates runtime.

You can also do multi-paramter big-O analysis. O(mn) is a thing for example, with two differently sized parameters, both of which contribute to the complexity of the algorithm and can't be dismissed.

So charitably, if you need to provide a big-O for say, hash table inserts, it's reasonable to say O(mn) where m is the length of the table and n is the length of the input string, but it's not necessary since table length usually has little contribution to the complexity of the algorithm...hence why people keep saying inserts are O(1) on average, because table length isn't much of a determinant in hash table operation complexity. Just like big-O hides constants, we can hide parameters that are dominated by another. O(1) is doing this backwards.

My guess is that O(1) is some weird cargo-culted remnant from some early work done on Hash tables as a generalization of Arrays, likely from when the hash functions were just modulus operations on integer keys or some other simple, fixed length hashing method that was easy to ignore (like the universal hash in Corman's "Introduction to Algorithms". But modern hash-tables can't make these assumptions and I, and quite a few other folks, think that this needs to be rethought.

Some examples (I don't agree with all of these, but I think it makes the point):

https://stackoverflow.com/questions/2771368/can-hash-tables-...

http://lemire.me/blog/archives/2009/08/18/do-hash-tables-wor...

> Sound like a analytic club that's rarely accurate to real-world performance? Yup, that's big-O

If your complexity analysis isn't measurable by real-world performance, it's likely that you aren't analyzing the correct parameters.

2 comments

Cushman 4009 days ago

> And this results in meaningless complexity analysis which gives you things like "Insert is O(1) on Average but O(n) Worst Case or Θ(n)".

I dunno, that just sounds like a time complexity to me. Quick, what's the time complexity of quicksort?

> This analysis is using the hash table length as the parameter under consideration, but that's silly

I think you're just talking about something else, then? Sure, the analysis generally assumes a finite key size, and looks at performance as the table size increases. That's just pragmatism; people generally have a bounded key size and an unbounded amount of data.

If your complaint is that treating the key length as finite results in a complexity of O(1), then... that's the point. Treating the key length as finite results in a complexity of O(1).

Table size isn't much of a determinant. It isn't any of a determinant, on average. Only the key length matters. That conclusion is the whole point of this analysis.

> it's reasonable to say O(mn)

I'm confused if this is what you meant to write-- this is not an accurate complexity for a hash table insert because, as you have pointed out, a hash table insert doesn't depend on the table size. There should be no factor of m. Edit: Er, technically O(n) is in O(mn)? Is that your point? But O(mn) doesn't simplify to O(n) unless m is constant, which I don't think you're saying.

With respect to key size n and table size m, the average complexity should be O(n). If we let key size be finite relative to the size of the table, that gives us O(1). But if you don't like that, you can let key size grow to infinity and you're right back to O(n).

None of this is going to tell you if it's the right data structure to use, though.

> If your complexity analysis isn't measurable by real-world performance, it's likely that you aren't analyzing the correct parameters.

No, in this case you are looking at the correct parameters but using the wrong model. At least, I think; it's still not clear what you're trying to get done here where big-O is letting you down.

miscfuck 4009 days ago

On a hash with a million elements, the O(n) hashing of a 10-char key is negligible.

Also, you have to make apples to apples comparisons. In this case, you're comparing the time to search against the number of elements, and that's it. If you want the time to hash AND search - as a function of n - then your analysis holds, but you can't then compare that against other datastructures that do not have an equivalent hash step.