Hacker News new | ask | show | jobs
by Cushman 4009 days ago
Edit: Ugh, the old version of this post turned out really lectury, and I'm still working on my own intuitive understanding here, so I'm just gonna start over.

Basically-- you're not wrong. It's a quirk of analysis that a hash table is allowed to take the modulo (for example) of an arbitrarily large number, producing an arbitrarily large number, in "constant time", while a trie is not allowed to descend to an arbitrary depth in constant time.

But saying that such a thing is possible is a simplifying assumption that we make, in general, for non-numeric algorithms: arithmetic is O(1).

Of course, real programs cannot do this for arbitrarily large numbers. So if you are actually concerned with arbitrarily large numbers, you must do something else. Thus, other sorts of analysis which would give you a different number for hash table complexity. Yay!

And if you're having difficulty using the reported big-O value under the canonical model to compare algorithms which have very similar amortized performance in the real world, your problem is just that big-O is not for that thing you're doing.

1 comments

Thanks for taking the time to revise and give a thorough and fair reply.

To give background, my main concern is with the exceptionalism: the defining (good) thing about STEM is that there's actual logic to the discipline, where if I forget an answer, I can re-derive it. It's not just a game of memorization and regurgitation.

But O(1) for hash lookup feels like the old soft-sciency "hey, you just have to memorize this". I don't have one consistent model that lets me derive bounds, and which produces O(1) for hash lookup and O(log n) for trie lookup. The other convention of constant seek times is unrealistic [2], but it's at least consistent. This isn't.

Rather, it's supposed to be a big-O table. If these were presented as "in-practice" bounds, with its own consistent logic about which resources are scarce relative to which, that would be a different story. Then I would have a consistent set of conventions that produce O(1); it wouldn't come off as "oh, we assume the rules don't apply here".

>But saying that such a thing is possible is a simplifying assumption that we make, in general, for non-numeric algorithms: arithmetic is O(1).

Where? Every treatment I've seen takes arithmetic as linear in the number of bits. [1] This is still consistent with O(n log n) for sorts because you can assume a bound on element size, and still only have constant factor overhead.

>Of course, real programs cannot do this for arbitrarily large numbers. So if you are actually concerned with arbitrarily large numbers, you must do something else. Thus, other sorts of analysis which would give you a different number for hash table complexity. Yay!

But (as above), the problem is it's not even clear what sort of problem class we were doing that let's you assume this particular thing can be bounded but others can't.

[1] https://en.wikipedia.org/wiki/Computational_complexity_of_ma...

[2] Pretty sure that in the physical universe, memory seek requires n^(1/3) because you have to cram it into a volume whose size increases with the cube of longest distance between addresses, and travel time is linear in distance.

> Where? Every treatment I've seen takes arithmetic as linear in the number of bits. [1] This is still consistent with O(n log n) for sorts because you can assume a bound on element size, and still only have constant factor overhead.

This is probably the only point of disagreement here. If I understand your critique of the O(1) result, it's that addressing n elements takes arithmetic (hashing) on numbers of log n bits.

ISTM that, assuming an integer RAM model, that same critique must apply to any input. A mere pointer into an array of n elements has log n bits, so just incrementing and testing the pointer can't be O(1), and even iterating that array can't be O(n).

It's still hard for me to see any reason why the one or two arithmetic operations in a for loop can be O(1), while the larger but still constant number used by a hash function can't.

>This is probably the only point of disagreement here. If I understand your critique of the O(1) result, it's that addressing n elements takes arithmetic (hashing) on numbers of log n bits.

It's not the addressing I object to -- this whole domain consistently assumes O(1) seek times. It's the computation of the hash itself. A hash function that, by supposition, minimizes collisions is going to have to avalanche the whole thing such that it has to perform a complex calculation -- not mere lookup -- that incorporates multiple combinations of the elements. The growth in cost is not from doing seeks over bigger ranges but from computation. You can no more assume away that growth than you can assume that md5 is just as expensive than SHA256.

>ISTM that, assuming an integer RAM model, that same critique must apply to any input. A mere pointer into an array of n elements has log n bits, so just incrementing and testing the pointer can't be O(1), and even iterating that array can't be O(n).

I never objected to this before but I've always disagreed: iterating needn't be log(n) because any such procedure can be optimized away to get the next element via either a) a seek, or b) incremeting a value, which is constant amortized (because bit rollover is exponentially rare).