Hacker News new | ask | show | jobs
by kootenpv 2383 days ago
Spot on (I briefly touch on this in the article)! This is why I try to work with cheap-to-compute features. I used to calculate how unique all values were, but ended up taking a sample instead to speed that part up for large data!
1 comments

How does that compare to just doing compression on a sample of data? :)
Perfect interview questions to get to the juice details haha!

The problem is that choice of compression is very much dependent on the sample size, so this is why just choosing the algorithm based on running benchmarks on the sample will be off.

However, computing certain statistics on the sample and then doing machine learning makes a lot of sense!

Obviously not for simple/cheap to compute features such as number of rows/columns which actually do not take longer to compute as the data gets larger.

But I actually do this for predicting the uniqueness of values! You basically want to get an idea of how many unique values you have per column. But.. if you take a sample this will give you completely wrong numbers.

You can see my approach here: https://github.com/kootenpv/shrynk/blob/6a8675061d82aa65fc3b... Basically the formula calculates the uniqueness on a sample (e.g. 10000 rows), but then extrapolates the result to your actual data size. E.g. it finds 100 unique values in a sample of 10000, but in reality that means you maybe have 500 unique values.

Running the compression algorithms is O(kn) where k is the number of compression algorithms. Taking the machine learning approach is O(n).
As the author noted after you posted this, it’s not a given that the ML algorithm is O(n). It may be constant time (by looking only at column headers and a sample of data, say).

That said, I was really more interested in practical runtimes. Like, in practice the ML may have a high startup cost (e.g., due to cost of loading a model), whereas for most sized datasets linear complexity may be fine...

Models are cached and not large so the setup time is very low. I'll time it when I get to a PC