Hacker News new | ask | show | jobs
by 0cf8612b2e1e 1065 days ago
This looks very interesting. Having not read the paper or the author’s library, I am curious how well it scales to more features. I would naively assume it is doing an exponential number of column comparisons.

Keeping in mind the strong correlation between divorces and margarine (https://tylervigen.com/spurious-correlations) I am still tempted to wire this up to automatically generate reports when data falls outside of a certain threshold.

1 comments

I read the paper. It’s fundamentally I think an exponential problem but they apply some constraints to reduce it: only considering tuples of size n <= 3, pruning.

Personally I think the case where you’re doing this on one column is not that interesting: for each column, get all values with >= support frequency, group by it on old and new, include in results is risk_ratio over threshold. List results in order of risk_ratio. Going from that to just two columns is much much more computationally demanding and where pruning and such really matters.

The article was giving me a market basket analysis vibe(maybe just the shared terminology), so possibly a similar underlying algorithm to do the pruning.