| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nirinor 1238 days ago

Very nice formalization.

One area for refinement: it considers two stacks either identical or unrelated. Consider that stack A;B is actually very close to A;B;C, the difference might be due to a sample time occurring just before or just after the call to C. OP considers them just as different as A;B and Z;W, therefore amplifying a measurement noise.

This suggests using a refined metric between stacks (e.g., an edit distance counting pushes and pops), and then we can use it in defining the metric between flamegraphs (e.g., an optimal transport metric [1], instead of the proposed L1).

Avoiding that noise amplification reduces the background noise level, therefore the cost of effective measurements. From another perspective, the current OP scheme creates an avoidable curse of dimensionality in the form of the Hotelling test's requirement that each sample has more measurements have more samples than distinct stack frames. So the same code split into more functions is harder to measure, and too-small samples are useless. I think neither of those is necessary if we take stack similarity into account.

[1] https://en.wikipedia.org/wiki/Wasserstein_metric