Hacker News new | ask | show | jobs
by titzer 2951 days ago
Clickbaity paper title. That raised my hackles a bit.

This is basically bucket sort with machine learning thrown in. It is not O(N). O(N) has a very specific mathematical meaning, do not abuse it. It means there exists some fixed K that for every N and every input of size N, the running time is <= K*N. This algorithm does not have that property, so it is not O(N). It might be expected linear running time, but it is not O(N).

From the conclusion: "For some distributions with less smoothness, it might be difficult to learn the distribution, and if the training fails...the algorithm might fail."

Grr.

2 comments

Also, they end up with a O(N*M) algorithm. They claim M is small, but it needs to be strictly smaller than logN if they want to be better than O(NlogN). A terabyte is 10^12. Thus log(N) is 12. They have 100 hidden neurons which is their M. So their algorithm isn't asymptotically faster, unless they go in N>10^100 with that amount of hidden neurons.
For 32bit ints that in the order of 3.6e88 TBs.... To put that into perspective the visible universe contains an estimated of 10^82 atoms at most, so if a single atom contains 1 petabyte of data, and you want to sort all atoms and all or their data - this ML approach would be faster. I wish they had that discussion in the paper.
> A terabyte is 10^12.

It's 2^40, and log2(2^40) is 40.

While i like the novel approach to see if it's possible, I miss some discussion that relates them to other algorithms for known scenarios. Sorting is embarrassingly easy to parallelize. Basically you can Divide'n'Conquer with as many machines as needed with only the networking and storage interface as the limiting factor. Could it be cheaper when using a GPU and ML? Probably, but when the running time is unpredictable and highly dependent on the training and you cannot even guarantee correctness, is it really worth it?

I'd be more interested in seeing the analysis in terms of IO operations; because the CPU will be plentiful when the data set grows beyond what is possible to juggle in the memory.