Hacker News new | ask | show | jobs
by dkbrk 1370 days ago
Honestly, the inverse Burrows-Wheeler transform seems like some sort of Voodoo black magic to me.

It reminds be of the 100 prisoners problem [0]. Yes, I understand why it works. I can see how it works mathematically. But it still feels like it shouldn't work, that we're somehow getting something for free.

[0]: https://en.wikipedia.org/wiki/100_prisoners_problem

6 comments

Explained here with Go code:

https://bugfix-66.com/7f0f425d3eee8def16e1102d054fd45394d027...

Wikipedia gives an impractical, inefficient account of the algorithm. It's almost comically bad.

But the BWT has an efficient inverse transform with very short, simple code, as seen above. This is the classic linked list algorithm.

The forward transform is also simple if done right. See the Go code here:

https://bugfix-66.com/72949624ebbb28b3d0ce5d700970a8857d354b...

Together, that's full code for industrial-strength forward and inverse BWT. All that's missing is some method of avoiding degeneracy in the forward transform string sort (e.g., noising long runs in the input or a suffix array).

It's frustrating that your link to an "explanation" is an intentionally buggy implementation of the algorithm :(

I understand the point of the debugging game, but in this context, a link to the solutions page would be more helpful....

Replace

  at = root
with

  at = next[root]
and you're done. Once you understand the forward transform, you understand that root is the LAST byte and next[root] is the FIRST byte.

All the fixes in BUGFIX-66 are trivial if you understand what the code is doing, that's part of the game concept. Also, if you submit broken code, it gives you a useful hint to direct your attention to the bug.

Similarly, the fix for the forward transform is to add

  i1, i2 = x[i1], x[i2]
at the top of the less() function.
> Wikipedia gives an impractical, inefficient account of the algorithm. It's almost comically bad.

I encourage you to edit and fix it.

I love your site! Hope you don't mind that I sent you an email.

It's a fantastic example of minimal yet fully functional!

Thanks. I just recently started building it.

Understanding small, elegant algorithms (enough to make tiny fixes) is something I consider fun. Hopefully it's fun for others.

Imagine the book Hacker's Delight, but turned into a simple debugging game. That's the concept.

> that we're somehow getting something for free.

The splay tree also falls squarely into this category for me. The core bit:

> [...] splay trees can take better than logarithmic time, without requiring advance knowledge of the pattern. According to the unproven dynamic optimality conjecture, their performance on all access patterns is within a constant factor of the best possible performance that could be achieved by any other self-adjusting binary search tree, even one selected to fit that pattern.

I've been iterating a kind of append-only log + database system using this as a technique to optimize I/O patterns. Many forms of storage are very happy when you feed them sequential bytes and prefer reads from more-recently-written physical blocks.

Having read through the (very good!) Wikipedia explanations a few times, I'm left with a different troubling intuition: if this works to improve compression, then I feel like there must be something wrong with the compression algorithms!
General-purpose compression algorithms tend to be conceptually simple. They must be fast, so they can't try many different approaches to see what works best. And because they are general-purpose, they can't assume much about the data.

The typical data compression algorithm starts with a combinatorial transformation of the data. Then it creates a statistical model of the transformed data and encodes it according to the model. There are two successful families of combinatorial transformations: Lempel–Ziv and Burrows–Wheeler. Neither of them is superior to the other, but the compression performance depends on the properties of the data.

Lempel–Ziv parsing replaces copies of repeated substrings with references to an earlier copy. It works best when there are long repeated substrings in the data. The modeling/encoding parts of an LZ compressor are usually pretty simple.

Burrows-Wheeler transform reorders the symbols according to the context they occur in. It's often but not always followed by run-length encoding. Because the symbols are reordered by context, BWT-based compressors work best when there are many copies of each repeated substring.

Unlike the LZ, the BWT does not compress the data on its own. It relies entirely on statistical compression. Because the symbols are ordered by context, the BWT is an order-n model for every n simultaneously. Hence you get something similar to PPM by simply partitioning the BWT into blocks and encoding each block independently with an order-0 encoder.

Not necessarily. After the transform, there are more runs of characters, but it could be that this comes at the expense of recognizable patters (like common words) in the input. It likely helps simple compression algorithms more than sophisticated ones.
I clicked this thing on the front page just to leave this precise remark (minus reference to the 100 prisoners). As you already did that, it seems my only sensible option is to voice my agreement.
Same here. m2^2 ;) toying around with these kind of indices to detect / find repetitions for years already.

Btw: what's the best source for these kind of one of a kind, contra-intuitive algos and datastructures? Be it online, be it books...

BWT can't be the only one, right?

The output string implies extra information: the sorted string, ie. two adjacent rotations, ie. how to chain the letters back together. Knowing BW was applied matters!

For the 100 prisoners problem, the intuition corresponds to n-nested verb tenses, which don’t exist in any language AFAIK.

I don‘t understand why there is a difference between following any algorithm for picking drawers or picking them at random. Isn’t the chaining algorithm they describe just another form of random order (with the warden being the random number generator)? Is there a simple explanation?
If you randomize a list of numbers there will always be cycles like described (similar to how there will always be cycles if you would draw lots for Secret Santa and then reveal who drew who), and these cycles have a certain length. It might be one cycle of length 100, but more commonly there will be lots of shorter cycles. If you start with your own number, you are guaranteed to be in a cycle that will return back to you, you just don't know how long that cycle will be (could be 1, could be 100, more likely something between). And apparently you can calculate that in ~31% of cases all cycles will have a length of 50 or less. This is just a neat way to exploit a kind of structure that there always will be within the randomness.
The key insight is that this strategy causes there to be a strong correlation between the success of different prisoners.

As an example, let's say prisoner #10 opens his box and finds #21, then finds #5 in that box, then #84, then #51, then finally succeeds and finds his number 10 in box #51. These boxes form a cycle 10-21-5-84-51 (and then back to 10). Anyone who opens any of these boxes will eventually see the same set numbers that #10 did, so that means that we know prisoners #5, #10, #21, #51 and #84 will all succeed in finding their number by starting at their own box.

Compare that to the situation where they just randomly look in boxes and each independently have only a 50% chance to succeed - then the odds that those 5 prisoners all succeed would be (1/2)^5 = 1/32, or only about 3%. In every cycle containing n prisoners, instead of their success rate being (1/2)^n, now they all succeed together or all fail together.

Now that the success of every prisoner in the same cycle is perfectly correlated with each other, the prisoners' overall chance of success depends only on whether the random permutation created by the warden has any cycles >50 in length. If so, then everyone in that cycle will fail. If not, then all the cycles across the 100 boxes are short enough that all 100 prisoners will succeed.

It doesn't affect the chance of any one prisoner finding their number - that's still 50%. But the joint strategy massively reduces the independence of the individual prisoners' success, and hence the distribution of the number of successful prisoners.

Rather than a normal distribution (sum of independent outcomes) with a big peak around 50% of the prisoners finding their number and a vanishingly small chance of them all (or none) finding the right number, you end up with a much more complex pattern - there's an approximately uniform distribution in the 0-50 successes range with ~70% of the overall outcome, and a huge peak at the 100 successes point with ~30%.

https://www.r-bloggers.com/2014/08/update-100-prisoners-100-... has a bunch of distribution plots showing this.

No. That rules means you are definitely on the right track.

Think of it like having to search for someone in a forest with only enough time to search half the forest.

Would you want to search random spots or be dropped randomly into the trail to the person?

You might yet run out of time, but on average it'll be much more efficient. Surprisingly so in the 100 boxes context.

Veritasium has an excellent video on it but I'm on mobile and can't dig out a link. I'd recommend it though.