| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bane 4010 days ago

I don't know why the complexity estimates I find for Hashes are always so bad. They never account for growth (or depending on the implementation shrinking on delete), never account for the hash function, etc.

By the time you hash a key, you could have likely already inserted it into a trie.

Lookups on hashes are also not O(1) for similar reasons. You have to hash the search string, then compare the value at whatever location it hashed to (usually a string-comparison operation which aren't O(1)) and depending on the collision strategy, do more things if it doesn't match, but isn't an empty value.

1 comments

chrisseaton 4010 days ago

When you say O(something), the something has a unit.

Hash table lookups and insertions take time O(1), when talking about number of items already in the hash table - and that 'when talking about' is implicit and doesn't have to be said as anyone talking about the complexity of a hash table knows that, or would state otherwise as it would be an exception.

Talking about the length of strings used as keys in the hash table is therefore nonsense when we have already agreed that we're talking about the number of elements in the hash table, as length of the strings isn't a parameter in that function.

And they never account for growth - yeah, it's amortised isn't it? That's what we wanted to do when we do an O().

I mean what you are proposing would be an interesting study - the complexity of hash tables parameterised for all sorts of different properties such as the size increase function or the load factor or whatever, but it's not what anyone means when they talk about complexity of a hash table unless they state otherwise.

So nobody's getting it wrong or making a mistake. They're using some assumptions that everyone's already agreed on and know about. If you want to talk about the complexity of something else you need to clarify.

link

rnovak 4010 days ago

I have two kind of nits with this logic, but I could totally be wrong, and you should feel absolutely free to correct me.

I'm fairly positive that a unit of measure should _never_ be variable, otherwise it's fairly pointless. And if you don't think hashing a 1TB string takes significantly longer than a 100byte string ...

Futher more, big O notation is supposed to be a wide upper bound, but I don't think that works well if it's not actually an upper bound. If you were to tell your bosses something took constant time, and it took an hour for string A (1TB) and 100ms for string B (10B), I'm pretty sure your opinion wouldn't mean much after that.

The hashmap should absolutely be considered in terms of the hash function, because it's the longest running part of the algorighthm. To do otherwise is disingenous.

Using that logic, I could call any iterative algorithm O(1) since the top level function only gets called once.

link

chrisseaton 4010 days ago

I think the problem you would have with your boss would be that your boss asked you 'how long will this program run', and if you told them O(1), the question you are really answering is 'what is the time complexity of this algorithm, parameterised by the number of entries, as the number of entries tends towards infinity'.

If your boss really wanted O(), then they wouldn't care that hashing one key takes a day and another a second, because they're thinking in terms of a hash with infinite entries, so the difference between a day and a second to hash is irrelevant.

link

rnovak 4009 days ago

If you released software that was exponential, but your QA department only ever tested small inputs, I think it would be negligent to omit to your boss and|or clients the rate at which run-time could expand.

and I think it would be a horrible manager to not care about the difference between a day and a second.

link

chrisseaton 4009 days ago

Yeah I agree but my point is none of what you are talking about has anything to do with O(), which doesn't attempt or claim to do what you want to do.

It's like someone gave you a hammer and you're saying it's broken because it doesn't cut wood very well. It's not designed for that.

link

chinpokomon 4010 days ago

To implement the hash table you wouldn't have to hash the whole string... of course this will depend on the data that you are trying to store. Assuming that the data is random, 100 bytes vs. 100 terabytes, you only need to figure out what bucket the data is saved. You could still base it on this concept if sightly modified

link

rnovak 4009 days ago

Yes. A valid hash function certainly can be defined as a F(n): N x N -> n % 100

But that certainly cannot be considered a reasonable hash function. A string is basically an array of bytes (or code-points, in case of UTF-8).

To have any decent property (like, producing different outputs for miniscule changes in the input), you have to touch every element in the array.

For custom objects, yes, you don't have to hash every property, but for strings, yeah, the hash function will almost always depend on the length of the string.

link

bane 4010 days ago

Part of the problem when estimating hash complexity is that what's usually considered is something that's basically memory + offset. Basically a few mov's and an add.

However, this is absolutely dominated by complexity of the hashing function, growth (which can be amortized, but is not O(1)) deletion (also not O(1)) and comparison functions (which are usually O(mn) or O(n) (or some similar depending)).

We end up measuring the things that are the most minimal in hashing and pretending like the expensive operations, which aren't O(1), don't exist. This is wrong and dishonest when considering algorithms. We know for example that string comparison is not O(1), and it's usually a part of most hash table algorithms, and yet it magically disappears when analyzing hash table complexity and everything is somehow supposed to be considered as a mem+offset which is stupid.

Hash-tables also usually have exponential memory growth complexity which nobody ever pays attention to. Of course there are versions which don't do this and/or have fixed memory sizes, but the random kind you find in most languages grow O(n^2) or similar. And this is also ignored and we pretend it doesn't happen and resizing magically becomes part of the "1" in O(1)....even though copying arrays isn't free and as resizes happen arrays get bigger and big-O should get worse.

Hash-table operations can be pretty expensive and faulty big-O analysis doesn't help. So sure, for a static, fixed size, hash table, with no growth, instantaneous hash functions that use temporal oracles, don't need to deal with collisions, O(1) is correct. But these hash functions pretty much don't exit in nature, and most programmers won't be working with them. The standard libraries for most languages sure as hell don't implement such ideal hash tables.

O(n) is at least an honest approximation. I honestly have never seen a well considered analysis of hash function complexity but I know it grows super-linearly from just using them a lot. This kind of fanciful analysis doesn't help anybody.

So yes, O(something) where something has a unit. But the unit sure as heck isn't whatever "1" is supposed to represent. It's probably closer to string length or array length (or some combination of the two), but a single "add" it is most definitely not.

link

Cushman 4009 days ago

All of what you're saying is true, about the speed of real algorithms running on real hardware with real problem sizes.

But if you're talking about big-O, you are explicitly not talking about that. You're talking about how the speed of the algorithm hypothetically scales as some parameter tends to infinity.

To wit, O(n) doesn't mean "this algorithm takes kn time to run for a given n", it means "this algorithm's runtime for all n > c for is bounded above by nk for some c and k".

Sound like a analytic club that's rarely accurate to real-world performance? Yup, that's big-O :)

link

bane 4009 days ago

Right, so I'm asserting that the parameter usually called for in big-O analysis of hash table operations is the wrong one since it measures the lookup complexity and not the hashing complexity, which is usually O(n). And this results in meaningless complexity analysis which gives you things like "Insert is O(1) on Average but O(n) Worst Case or Θ(n)".

This analysis is using the hash table length as the parameter under consideration, but that's silly, because most of the complexity of hash-tables is in the hashing which (depending on the hash function) usually has a known complexity of O(n). Where n is the length of the input string.

This is far more important in hash table complexity analysis than table size because this part of the operation dominates runtime.

You can also do multi-paramter big-O analysis. O(mn) is a thing for example, with two differently sized parameters, both of which contribute to the complexity of the algorithm and can't be dismissed.

So charitably, if you need to provide a big-O for say, hash table inserts, it's reasonable to say O(mn) where m is the length of the table and n is the length of the input string, but it's not necessary since table length usually has little contribution to the complexity of the algorithm...hence why people keep saying inserts are O(1) on average, because table length isn't much of a determinant in hash table operation complexity. Just like big-O hides constants, we can hide parameters that are dominated by another. O(1) is doing this backwards.

My guess is that O(1) is some weird cargo-culted remnant from some early work done on Hash tables as a generalization of Arrays, likely from when the hash functions were just modulus operations on integer keys or some other simple, fixed length hashing method that was easy to ignore (like the universal hash in Corman's "Introduction to Algorithms". But modern hash-tables can't make these assumptions and I, and quite a few other folks, think that this needs to be rethought.

Some examples (I don't agree with all of these, but I think it makes the point):

https://stackoverflow.com/questions/2771368/can-hash-tables-...

http://lemire.me/blog/archives/2009/08/18/do-hash-tables-wor...

> Sound like a analytic club that's rarely accurate to real-world performance? Yup, that's big-O

If your complexity analysis isn't measurable by real-world performance, it's likely that you aren't analyzing the correct parameters.

link

Cushman 4009 days ago

> And this results in meaningless complexity analysis which gives you things like "Insert is O(1) on Average but O(n) Worst Case or Θ(n)".

I dunno, that just sounds like a time complexity to me. Quick, what's the time complexity of quicksort?

> This analysis is using the hash table length as the parameter under consideration, but that's silly

I think you're just talking about something else, then? Sure, the analysis generally assumes a finite key size, and looks at performance as the table size increases. That's just pragmatism; people generally have a bounded key size and an unbounded amount of data.

If your complaint is that treating the key length as finite results in a complexity of O(1), then... that's the point. Treating the key length as finite results in a complexity of O(1).

Table size isn't much of a determinant. It isn't any of a determinant, on average. Only the key length matters. That conclusion is the whole point of this analysis.

> it's reasonable to say O(mn)

I'm confused if this is what you meant to write-- this is not an accurate complexity for a hash table insert because, as you have pointed out, a hash table insert doesn't depend on the table size. There should be no factor of m. Edit: Er, technically O(n) is in O(mn)? Is that your point? But O(mn) doesn't simplify to O(n) unless m is constant, which I don't think you're saying.

With respect to key size n and table size m, the average complexity should be O(n). If we let key size be finite relative to the size of the table, that gives us O(1). But if you don't like that, you can let key size grow to infinity and you're right back to O(n).

None of this is going to tell you if it's the right data structure to use, though.

> If your complexity analysis isn't measurable by real-world performance, it's likely that you aren't analyzing the correct parameters.

No, in this case you are looking at the correct parameters but using the wrong model. At least, I think; it's still not clear what you're trying to get done here where big-O is letting you down.

link

miscfuck 4009 days ago

On a hash with a million elements, the O(n) hashing of a 10-char key is negligible.

Also, you have to make apples to apples comparisons. In this case, you're comparing the time to search against the number of elements, and that's it. If you want the time to hash AND search - as a function of n - then your analysis holds, but you can't then compare that against other datastructures that do not have an equivalent hash step.

link