Hacker News new | ask | show | jobs
by chillee 2359 days ago
I understand the sentiment, but O(n) vs O(1) and O(n^2) vs O(n log n) are huge jumps in complexity. Even with relatively small sizes like N=100, you're already starting with 2 orders of magnitude disadvantage.

The example in this post is a log N factor. log N is a relatively small growth, you'd need 1000 elements to get 1 order of magnitude and you'll never run into 2 orders of magnitude.

If you can come up with reasonable code where an order N complexity slower algorithm is faster in practice - I'd love to see it.

5 comments

The issue here is mostly about what you are counting.

Almost all instances I've seen of big O scaling not being realistic count every memory access as 1 operation. In reality, memory accesses have a huge variation in how much time they take. Moreover, this does have a dependence on the program.

This is of course due to caches. As such, an O(n^2) algorithm that has very local memory accesses can actually beat an O(n) algorithm that spreads out its memory accesses randomly over the entire memory space.

It has come to the point where I care more about the expected cache hits of an algorithm than the big O scaling.

I was about to say "the example in the post is sorting so it's n * log(n)".

Upon thinking about it, the example is comparing the relative speed of hashing every item (O(n) * O(1)) == O(n) and sorting (O(n * log(n)).

That means the ratio of these two is a log n factor n * log(n) / n == log(n).

good point.

As a side point, I think the main point of the post is that the time taken for the hash solution, increases at a greater rate than the sort + unique solution. Now, that doesn't make sense with our big 0 notation.

This is because we don't have a simple hashmap, what we have is a dynamically resizing hashmap, which we don't set a capacity on, and the resize operations are order n. Now, how often can we expect to resize? I think most hashmaps double their capacity, so there will be log(n) resize operations that each take O(m) (where m is the size of the hashmap when it needs to be resized).

Now at this point, I can't be bothered to think further about it. That feels like it should be less than O(n * log(n)), but it's kinda close. Either way, it's definitely larger than the simpler O(n) case we're thinking about.

Dynamic resizing a hash table by doubling its size when it reaches a threshold capacity is amortised O(1) per insert.

If there are deletions, both operations are amortised O(1) provided they have been sensible and made the deletion threshold a lower factor.

It's the same sort of pattern as dynamically resizing a growing string. As long as you resize by a constant > 1 factor, rather than an addition, the number of resizes is low enough that the O(n) cost of copying is amortised over prior appends.

> Either way, it's definitely larger than the simpler O(n) case we're thinking about.

It actually isn't larger than O(n), setting aside constant factors.

For a hash map's time to grow more than O(n) for n insertions starting from empty, suggests an implementation problem.

I should have thought about things more. I think this is a case of confirmation bias on my part.

I thought "what about dynamically resizing" and quickly googled complexity of rehashing a hash which confirmed my thought that it was n.

I guess I didn't think that since we must have inserted n values in at this point, we could just say "insertion is O(1)" and the constant factor would be "performing two hashes" if we are going to a point where it's resizing, i.e. pay the cost of hashing once and rehashing once.

that feels like it makes sense. I'm being a bit hand-wavy as I don't want to sit down with pen and paper.

I retract my incorrect statement above. (I can no longer edit it to add postscript retraction.)

Sure. But, in practice it’s never a debate between O(1) vs. a O(large n). It’s always either a heavy O(1) vs. a light O(small n) —-hash vs. linear search. Or, a heavy O(large n) vs. a light O(large n log n) —-hash everything vs. sort everything. In both cases, the second option usually wins.
Spitballing here, but I’d suspect a “hash table” implemented as a vector of tuples of key/value pairs with linear time search for the keys would outperform a canonical hashtable for simple JRPC APIs where the vast majority of your JSON objects have fewer than a dozen elements.

Would be interesting to build a hashtable with an equivalent “small string optimization” approach where you have a fixed sized array that is replaced by a canonical hashtable once it grows past some threshold.

Eh, the optimization you talk about is basically how hash tables are implemented in languages like Ruby and Crystal.
It's not only about the constant factor, but also about the fact that O is only valid toward infinity, and potentially about how computer actually work (e.g. the extreme discrepancy between the latency of L1 vs RAM)