Hacker News new | ask | show | jobs
by Retr0id 85 days ago
Super cool!

A related situation I was in recently was where I was trying to bisect a perf regression, but the benchmarks themselves were quite noisy, making it hard to tell whether I was looking at a "good" vs "bad" commit without repeated trials (in practice I just did repeats).

I could pick a threshold and use bayesect as described, but that involves throwing away information. How hard would it be to generalize this to let me plug in a raw benchmark score at each step?

3 comments

I have this same issue a lot.

I vibe up a lot of really simple casual games, which should have very minimal demands, and the LLM-agent introduces bad things a lot that don't present right away. Either it takes multiple bad things to notice, or it doesn't really affect anything on a dev machine but is horrible on wasm+mobile builds, or I just don't notice right away.

This is all really hard to track down, there's noise in the heuristics, and I don't know if I'm looking for one really dumb thing or a bunch of small things that have happened over time.

This is a real pain point. One thing that helps: when an LLM agent makes changes across multiple commits, look at what it actually touched structurally. Often the agent adds a feature in commit 5 but subtly breaks something in commit 3 by changing a shared function it didn't fully understand.
At a guess, you can reuse the entropy part, but you'd need to plug in a new probability distribution.
I don't yet know a better way to do this than using a threshold!

I think if you assume perf is normally distributed, you can still get some of the math to work out. But I will need to think more about this... if I ever choose this adventure, I'll post an update on https://github.com/hauntsaninja/git_bayesect/issues/25

(I really enjoy how many generalisations there are of this problem :-) )

Perf is not normally distributed. Or rather, it is very very common for it to not be. Where I work, we often see multimodal (usually mostly bimodal) distributions, and we'll get performance alerts that when you look at them are purely a result of more samples happening to shift from one mode to another.

It's easy to construct ways that this could happen. Maybe you're running a benchmark that does a garbage collection or three. It's easy for a GC to be a little earlier or a little later, and sneak in or out of the timed portion of the test.

Warm starts vs cold starts can also do this. If you don't tear everything down and flush before beginning a test, you might have some amount of stuff cached.

The law of large numbers says you can still make it normal by running enough times and adding them up (or running each iteration long enough), but (1) that takes much longer and (2) smushed together data is often less actionable. You kind of want to know about fast paths and slow paths and that you're falling off the fast path more often than intended.

As usual you can probably cover your eyes, stick your fingers in your ears, and proceed as if everything were Gaussian. It'll probably work well enough!