|
The author of this blog didn’t handle minimality or max_order > 1 which basically works around all the benefits of DIFF and instead can just brute force across all column values with support over the threshold. It turns an exponential or polynomial (on cardinality of columns and number of columns) problem into a problem linear in number of column values across columns. But even DIFF has a problem: From the example, because I updated my application from v2 to v3, the initial set has 90% of (Galaxy, 11.0) records with v2 and 4% v3. After the update v2 is at 10% and now v3 is 80% of records with (Galaxy, 11.0). And at the time of both slices (Galaxy, 11.0) represents 80% of records. Let’s say the total crash rate increased 25% and I call DIFF on (version, client, os). DIFF highlights the (v3, Galaxy, 11.0) slice first even if the crash rate was the same for v2 because there was an organic shift in proportions of v2 and v3 records: the contribution is 20x and support is 75%. But actually, some combo in the slices outside of (Galaxy, 11.0) contributed most of the diff - it gets ranked second, with less contribution and less support. The “explanation” that v3 introduced a bug is wrong: we simply caught that v3 replaced most usage of v2. DIFF does roll ups, but DIFF applies a minimality and risk ratio threshold to results so the fact that the (Galaxy, 11.0) crash rate across v2 and v3 didn’t actually change may even cover up the true story from the result. But, the minimality threshold is what makes their implementation performant. If you disable minimality and directly compute all roll ups, you have to group by as many permutations as the power set of the column cardinalities, instead of the product of their cardinalities, which on k columns with cardinality ~n takes you from O(n^k) to O(2^(n^k)) (I think, please check the math). Note that in the experimental run with minimality disabled they performed worse than their comparison software, and that most datasets don’t have any mention of cardinalities. But, with minimality enabled, IIUC you don’t catch “shift within X^Y from X^Y^Z to X^Y^~Z that doesn’t explain overall shift because XY risk_ratio is 0” which is a problem. DIFF also mitigates the operate issue by limiting the number of total columns combined to produce answer. It’s a nice way to prevent long runtimes but it imposes additional constraints on real world utility - how often is it just version+client+os that’s a problem and not version+client+os+uses_feature+migrated_from_legacy_table? My post is getting too long but many state of the art experimentation tools work around this by controlling for distribution shifts between sets - doesn’t work well with a table of just crashes but works fine once you have a denominator like “usage time” |