Hacker News new | ask | show | jobs
by budabudimir 2795 days ago
I don't think a ordered data structure with O(1) insert time is even possible, as that would mean that you can do sort in O(n).
4 comments

> with O(1) expected restructuring time and exponentially infrequent occurrences of expensive restructuring

So I was curious about this, so I threw together a quick and dirty implementation of the insert function as described in the paper, counting both the number of times the insert() function is called and the number of times a pointer is written to when inserting 65535 (2^16 - 1) random numbers:

Below are the results:

    Inserts      0 to      1: Avg fn calls:   2.00, Avg ptr writes:   2.00 
    Inserts      1 to      3: Avg fn calls:   3.50, Avg ptr writes:   1.00 
    Inserts      3 to      7: Avg fn calls:   5.00, Avg ptr writes:   4.00 
    Inserts      7 to     15: Avg fn calls:   6.12, Avg ptr writes:   5.00 
    Inserts     15 to     31: Avg fn calls:   7.06, Avg ptr writes:   4.62 
    Inserts     31 to     63: Avg fn calls:   7.28, Avg ptr writes:   3.59 
    Inserts     63 to    127: Avg fn calls:   9.64, Avg ptr writes:   4.81 
    Inserts    127 to    255: Avg fn calls:  11.72, Avg ptr writes:   4.08 
    Inserts    255 to    511: Avg fn calls:  13.73, Avg ptr writes:   4.73 
    Inserts    511 to   1023: Avg fn calls:  17.61, Avg ptr writes:   5.30 
    Inserts   1023 to   2047: Avg fn calls:  18.95, Avg ptr writes:   4.99 
    Inserts   2047 to   4095: Avg fn calls:  19.12, Avg ptr writes:   5.05 
    Inserts   4095 to   8191: Avg fn calls:  19.67, Avg ptr writes:   4.98 
    Inserts   8191 to  16383: Avg fn calls:  20.76, Avg ptr writes:   4.94 
    Inserts  16383 to  32767: Avg fn calls:  22.38, Avg ptr writes:   4.99 
    Inserts  32767 to  65535: Avg fn calls:  24.18, Avg ptr writes:   4.99 
So it turns out that it is O(1) in terms of number of pointer writes but O(lg(n)) in terms of number of function calls, and since every function call will perform at least one comparison, that means this algorithm is O(1) for writes but O(lg(n)) for reads, meaning that given large enough n, the O(lg(n)) read term will dominate and thus this algorithm is O(lg(n)) for inserts.

That said, if writes are far more expensive than reads in the context this is used, this could still be useful.

Did you implement the recursive or the iterative version? Perhaps there isn't any difference in practice between them in this sense, though. I mean, the amount of function calls would differ but perhaps not the comparisons.
Tail call optimization in functional programming languages comes to mind.
Ignore all the other noise about expected vs amortized vs just O(1). It's not what's relevant here.

1. Zip trees do not have O(1) insert/remove. They have O(log n) insert/remove (because they have to search the tree to find where to insert/remove), with O(1) additional restructuring time.

2. You can have an ordered data structure with O(1) insert, but not O(1) insert and O(1) find-min simultaneously. You can have O(1) insert trivially by just delaying all actual inserts by putting them into an array, and only actually insert them later when a search happens.

Radix trees and radix sort can do that, though the constant is the key size.
O(1) expected is a different thing than standard O(1).
Indeed, could you expand on that a bit?

Even if what you meant is expected time complexity for the single insert operation is O(1), that surely cannot be the case. What would be expected time complexity of N inserts then?

Consider the humble array-list- a simpler example to be sure, but it's great for explaining.

Build an array of an initial basic size (16 let's say), and keep a counter of the current size. When you insert items 0 through 15, the time to insert is O(1)- find the current size, insert there. When you insert item 16 however, you need to make a new array of size (currentSize2), move the existing items over to it, then add your new item- which is O(n).

Let's say we insert n items (be in 64, 1024 or 2^32, it doesn't matter). What was the mean, median and mode for an insert?

Well, for any non-doubling case (1-k/k) it took O(1). For any doubling case (1/k of the time) it took O(k). This all adds up to a mode of O(1), a median of O(1) and a mean of (O(1)(k-1) + O(k)*1)/k, or roughly O(2), which is O(1).

Your worst case scenario of an insert is indeed O(n). But that's not the most common outcome.

Similar, from how I'm reading this paper, very rarely you'll have an O(ugly) restructuring during inserts or deletes, but the rest of the time you can expect O(1) while maintaining sorted order. I'm going to have to take a try at implementing it to see how it goes.

That's true for array, that expected insert time is O(1), and it still holds for N inserts, because you double the size every time, which yields O(2N) time complexity. This is not the case for this data structure, time complexity of inserting N elements is O( N log N ), therefore expected complexity of inserting single element cannot be O(1), even if single insert sometimes can be O(1),

O(1) in the paper referred to restructuring, after O( log n ) search time.

We could be violently agreeing, not sure :)

> you'll have an O(ugly) restructuring during inserts or deletes, but the rest of the time you can expect O(1) while maintaining sorted order.

It helps to understand why ugly restructuring is rare ... it's because the rank distribution is such that 1/2 of the time the height of the inserted node is 0 (a leaf), 1/4 of the time the height is 1, 1/8 of the time it is 2, etc. The ugliness of restructuring increases as the height increases, which means that ugliness becomes exponentially rarer. But it still takes O(lg n) to locate the insertion point -- remember, half the time the node is a leaf. And this is a binary search tree, so of course it is O(lg n). If insertion time were O(1) then building the tree would be O(n) which would mean an O(n) sort. That's simply not possible.

So, I think that's what is meant when they say:

with O(1) expected restructuring time and exponentially infrequent occurrences of expensive restructuring.

Occasionally there's O(n) expensive restructuring, but given the nature of how it's increasing space, when and why, and when it will need to happen again, it's twice (or whatever) the prior amount of time before it needs to happen again. I'm not up on my asymptotic notations[1] besides big-O, but perhaps this is more succinctly (if more confusingly, to a general audience) by one of the additional notations?

1: https://en.wikipedia.org/wiki/Big_O_notation#Related_asympto...

> given the nature of how it's increasing space, when and why, and when it will need to happen again

This is really rather straightforward. The whole point of the algorithm is to randomly distribute the height of the inserted node such that half the nodes are leaves (height 0), 1/4 have height 1, 1/8 have height 2, etc. Restructuring is more expensive the larger the height, but the frequency of restructuring decreases exponentially with respect to the cost of doing it. Net result: O(1) for restructuring, + O(lg n) to find the insertion point (as is necessarily true for any binary search tree, else we would have a way to do an O(n) sort).

Can you expand on that please?
They might mean best-case vs average-case vs worst-case. A worst-case hash table insertion might be O(n) (rehashing the table) and in the malicious case it might even be O(n^2) (all table entries collide in a single linked list).
mabbo probably means that the average insert time is O(1), but s/he's wrong and couldn't possibly be right. An O(1) average insertion time into a binary search (i.e., ordered) tree would mean O(n) to build the tree from n elements, which would mean an O(n) search ... no can do. And the paper clearly says that the O(1) restructuring is in addition to the O(lg n) time to find the insertion point ... which anyone familiar with BST algorithms would expect.