| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by CipherThrowaway 1026 days ago
	I did not see it mentioned in the article, but I wonder why or if the team rejected a permutation function on the index. The shuffled dataset would then be Xs[i] = X[f(i)]

1 comments

anonymoushn 1026 days ago

Do you have a cheap permutation function for large n? It seems like you still have to do it in two passes if you do this.

One reason cited in TFA for the half-shuffled files approach is that it's easy to rotate old data out of and new data into the half-shuffled files.

link

CodesInChaos 1026 days ago

There are format preserving encryption algorithms which act as a pseudo random permutation of integers with a variable upper bound. These slower than Fisher-Yates for in memory data, but for on disk data the random access overhead should exceed their cost. The advantage of this approach is minimal memory use and no need to modify the data.

The downside is that it still needs one random read access per element, so cache friendly hierarchical algorithms, like the one described in the post, are probably still faster for on disk data.

link

CipherThrowaway 1026 days ago

You could flip the order and do a sequential read from the source with a random write to the destination. It sounds like this would still suffer from poor performance on the data mentioned in the article, but it is hard to feel like some kind of opportunity hasn't been missed with index permutation.

For example, using the permuted the index mod M rather than random draw in the first pass could avoid the whole issue of oversized piles and the extra wasted space per pile.

link

CipherThrowaway 1026 days ago

The permutation function is essentially a block cipher on the index. It's stateless and there is no need for intermediate files or buffers. The cost of running the function should be comparable to RNG.

link