| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pizlonator 443 days ago

The worst part about this:

> Running experiments until you get a hit

Is that it's literally what us software optimization engineers do. We keep writing optimizations until we find one that is a statistically significant speed-up.

Hence we are running experiments until we get a hit.

The only defense I know against this is to have a good perf CI. If your patch seemed like a speed-up before committing, but perf CI doesn't see the speed-up, then you just p-hacked yourself. But that's not even fool proof.

You just have to accept that statistics lie and that you will fool yourself. Prepare accordingly.

6 comments

starspangled 443 days ago

> Is that it's literally what us software optimization engineers do. We keep writing optimizations until we find one that is a statistically significant speed-up.

I don't think that is what it is saying. It is saying you would write one particular optimization (your hypothesis), and then you would run the experiment (measuring speed-up) multiple times until you see a good number.

It's fine to keep trying more optimizations and use the ones that have a genuine speedup.

Of course the real world is a lot more nuanced -- often times measuring the performance speed up involves hypothesis as well ("Does this change to the allocator improve network packet transmission performance?"), you might find that it does not, but you might run the same change on disk IO tests to see if it helps that case. That is presumably okay too if you're careful.

LegionMammal978 443 days ago

"Multiple times" doesn't have to mean "no modifications". Suppose the software is currently on version A. You think that changing it to a version B might make it more performant, so you implement and profile it. You find no difference, so you figure that your B implementation isn't good enough, and write a slight variation B', perhaps moving around some loops or function calls. If that makes no difference, you keep writing variations B'', B''', B'''', etc., until one of them finally comes out faster than version A. You finally declare that version B (when properly implemented) is better than version A, when you've really just tried a lot more samples.

starspangled 443 days ago

Well it does mean "no modifications" to the hypothesis, hypothesis being about performance of code A and B. Code B' would be a change.

It's just semantics, but the point is that the article wasn't saying the same thing OP was worried about. There's nothing wrong with testing B, B', B'', etc. until you find a significant performance improvement. You just wouldn't test B several times and take the last set of data when it looks good. Almost goes without saying really.

LegionMammal978 443 days ago

Sure, it may not be precise repetition, but my idea here is that none of B', B'', etc. are really different than B (they may even compile down to the exact same bytecode), they're just the same thing but written differently. And in fact, none of these are really faster than A, even if they're all "changes". But it's the same issue as any other form of p-hacking, where you keep trying more and more trivial B-variations until you eventually get the result that you're looking for, by random chance. (Cf. the example in xkcd 882, which does change the experimental protocol each time, but only trivially.)

There is, in fact, "something wrong" with this, which is what GP was pointing out. It's literally covered under "Playing with multiple comparisons" in TFA.

(Personally, to combat this, I've ignored the fancy p-values and resorted to the eyeball test of whether it very consistently produces a noticable speedup.)

throwanem 443 days ago

Why is this bad for you? You're optimizing software, not trying to describe reality. Monte Carlo and Drunkard's Walk are fine.

analog31 443 days ago

You're churning the user experience for no reason. Maybe constant optimization churn is one of the reasons why UIs are so bad.

throwanem 443 days ago

Perf, though? If a perf optimization changes the UI noticeably other than by making it smoother or otherwise less janky, someone is lying to someone about what "performance" means. Likely though that be, we needn't embarrass ourselves by following the sad example.

No, UIs churn because when they get good and stay that way, PMs start worrying no one will remember what they're for. Cf. 90% of UI changes in iOS since about version 12.

appleaday1 443 days ago

I thought languages such as Rust and flamegraphs and etc were supposed to help us avoid doing all this testing and optimization right? Like I use the built in analysis tools that come with cargo and such and what I have on my os, tools like cutter or reverse engineering tools. Even on python I use the default or standard profiling and optimization tools, I wonder sometimes if I am not doing something enough if the default tools thats recommended should cover most edge cases and performance cases right?

pizlonator 443 days ago

Yeah!

And software ultimately fails at perfect composability. So if you add code that purports to be an optimization then that code most likely makes it harder to add other optimizations.

Not to mention bugs. Security bugs even

appleaday1 443 days ago

heck even the ai by default doesnt start with security from the models I have tested its really really weird.

cortesoft 443 days ago

Well, what is the test you are using to measure performance? Maybe the optimizations help performance in some cases and hurts performance in others... your test might not fully match all real world workloads.

jean_lannes 443 days ago

These seem like two different things. Testing many different optimizations is not the same experiment; it's many different experiments. The SE equivalent of the practice being described would be repeatedly benchmarking code without making any changes and reporting results only from the favorable runs.

pizlonator 443 days ago

Doesn’t matter if it’s the same experiment or not.

Say I’m after p<0.05. That means that if I try 40 different purported optimizations that are all actually neutral duds, one of them will seem like a speedup and one of them will seem like a slowdown, on average.

daveFNbuck 443 days ago

That's not p hacking. That's just the nature of p values. P hacking is when you do things to make a particular experiment more likely to show as a success.

bbertelsen 443 days ago

There's another cheeky example of this where you select a pseudo-random seed that makes your result significant. I have a personal seed, I use it in every piece of research that uses random number generation. It keeps me honest!

doubletwoyou 443 days ago

what they’re referring to might be better put as applying a patch once and then running it 500 times until you get a benchmark thats better than baseline for some reason

which is understandably a bit more loony

pizlonator 443 days ago

Nah it could be 20 different patches.

appleaday1 443 days ago

how can I do this in python what modules?