|
|
|
|
|
by 0cf8612b2e1e
1065 days ago
|
|
This looks very interesting. Having not read the paper or the author’s library, I am curious how well it scales to more features. I would naively assume it is doing an exponential number of column comparisons. Keeping in mind the strong correlation between divorces and margarine (https://tylervigen.com/spurious-correlations) I am still tempted to wire this up to automatically generate reports when data falls outside of a certain threshold. |
|
Personally I think the case where you’re doing this on one column is not that interesting: for each column, get all values with >= support frequency, group by it on old and new, include in results is risk_ratio over threshold. List results in order of risk_ratio. Going from that to just two columns is much much more computationally demanding and where pruning and such really matters.