Hacker News new | ask | show | jobs
by arnioxux 3005 days ago
> Robust? Yes. It would take virtually forever to run out of space through repeated list reordering.

I think the author missed a pathological case with using the Stern-Brocot tree.

The goal is to sort each item with a score. Then to insert something between two items, you calculate a new score that is "between" their existing scores.

- Option 1) Take the average of the two score a/b and c/d:

  (ad+bc)/(2bd)
- Option 2) Take the mediant(https://en.wikipedia.org/wiki/Mediant_(mathematics) ):

  (a+c)/(b+d)
We don't want to use the first option, the average because almost by definition you would require at least another bit of precision in the denominator for each insert due the multiplication by 2. This limits you to a meager ~64 repeated inserts before overflowing the denominator.

So the mediant is better because the precision requirements is limited by the number of times you can add instead of multiply. This might seem to grow a lot slower at first. And the author even showed that if you're repeatedly averaging 0/1 with the new score, you would get the pattern 1/2, 1/3, 1/4, 1/5, 1/6, .... Which means you can handle up 2^precision number of inserts which is much better than before.

But this only applies to inserting to front (medianting with 0/1) or back (medianting with 1/1) because this will only add a 0 or 1 each time.

If you go down a zigzag path down the middle of the tree instead, you can get the two numbers added to be much closer in magnitude and explode. For example: 1/1, 1/2, 2/3, 3/5, 5/8, 8/13, 13/21, 21/34, ... (it's fibonacci!).

So essentially you get the same thing as the averaging case where the precision required roughly doubles(1.6180339887...) at each step and it will overflow quickly.

This isn't a purely theoretical edge case either. The example from before can be realized by repeatedly inserting between the last two items inserted, which arises naturally if you're repeatedly adding to the middle of a list.

3 comments

As far as I can tell, there's really no advantage to using floating point/fractions at all. I've used an integer-based solution for this very problem in the past.

You can apply the same logic of finding the midpoint between two integers using very simple integer math, store the result as an integer, and get high performance. Just don't start by populating your list with small numbers, the first number would be 0 and the second one would be the midpoint between 0 and either MAXINT or MININT depending on if it came before or after. This ought to have all the advantages of approach 3 (True Fractions) with none of the disadvantages. I'd create a function integer_intermediate just like the rational_intermediate function documented in the article too.

The problem with floating points is that the available precision is highly variable. Sure you can bisect between 0 and 1 many times, but what about between 1000 and 1001? Those 10 bits are lost.

This seems like the best approach on this thread. Decimals and floating points are plain out of place here. An extremely simple bisection between two elements a and b where either is null if at the head or tail of the list would do it:

bisect(a,b) = (coalesce(a, MININT)/2) + (coalesce(b, MAXINT)/2)

gns24 is right, if you're using 64 bits of precision, no matter the format, there must be some combination of at most 64 insertions that trigger the edge case of trying to insert between two items that have no representable value between them. I feel like there's a simple proof with e.g. pigeonhole principle.

So all of these approaches will have to be paired with a normalization routine that takes takes the bisected values and spreads them back out across the space to allow insertion again. Probably an after update/insert trigger that only kicks in when a new item is placed directly next to another, and spreads out all the nearby keys so that there is some minimum distance between them.

Exactly. If we're storing the positional value into 64-bits then there must be some patterns of 64 inserts which don't fit.

The decimal solution is really not bad. You could improve on it by not taking the midpoint between the two values you're inserting between, but rather another point which doesn't require an extra digit, when that's possible, but really that's very likely to be an unnecessary optimisation. So continual insertion on the end might give 0.5, 0.8, 0.9, 0.95, 0.98 0.99, 0.995...

That's a really elegant way to put it. Yay for information theory.
What if, periodically, the orders are recalculated? Like 95% of cases are not going to reorder items back and forth 64 times in a session. If that does happen then take the extra time to reset everything (1,2,3,4...).

And maybe periodically do the same for all values in your table? Like when users are not interacting with the items maybe.

Thank you! I was intuitively sure this was the case (i.e., that you couldn't do markedly better than doubles when working with a fixed size of 64 bits) but I didn't see how to construct the pathological sequence like you did.