| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kootenpv 2383 days ago
	Spot on (I briefly touch on this in the article)! This is why I try to work with cheap-to-compute features. I used to calculate how unique all values were, but ended up taking a sample instead to speed that part up for large data!

1 comments

md_ 2383 days ago

How does that compare to just doing compression on a sample of data? :)

link

kootenpv 2382 days ago

Perfect interview questions to get to the juice details haha!

The problem is that choice of compression is very much dependent on the sample size, so this is why just choosing the algorithm based on running benchmarks on the sample will be off.

However, computing certain statistics on the sample and then doing machine learning makes a lot of sense!

Obviously not for simple/cheap to compute features such as number of rows/columns which actually do not take longer to compute as the data gets larger.

But I actually do this for predicting the uniqueness of values! You basically want to get an idea of how many unique values you have per column. But.. if you take a sample this will give you completely wrong numbers.

You can see my approach here: https://github.com/kootenpv/shrynk/blob/6a8675061d82aa65fc3b... Basically the formula calculates the uniqueness on a sample (e.g. 10000 rows), but then extrapolates the result to your actual data size. E.g. it finds 100 unique values in a sample of 10000, but in reality that means you maybe have 500 unique values.

link

ma2rten 2383 days ago

Running the compression algorithms is O(kn) where k is the number of compression algorithms. Taking the machine learning approach is O(n).

link

md_ 2382 days ago

As the author noted after you posted this, it’s not a given that the ML algorithm is O(n). It may be constant time (by looking only at column headers and a sample of data, say).

That said, I was really more interested in practical runtimes. Like, in practice the ML may have a high startup cost (e.g., due to cost of loading a model), whereas for most sized datasets linear complexity may be fine...

link

kootenpv 2381 days ago

Models are cached and not large so the setup time is very low. I'll time it when I get to a PC

link