Hacker News new | ask | show | jobs
by confluence_perf 1992 days ago
"High variance as a consequence of poor software" is an interesting point and not one I'd considered -> I will take this to engineering and see if we can do anything about that (some components maybe, but we see high network variances too which seem unlikely to be fixable).

Showing that we're making progress isn't as much of a problem - similar to what you stated, the fixes themselves target large enough value that it's measurable at volume for sure, and even in testing.

The main issue is "degradations" -> catching any check-ins that can degrade performance. These are usually small individually (lets say, low double digit MS) within the variance noise), but add up over time, and by the time the degradation is really measurable, its complicated tracking down the root cause. Hopefully I described that in a way that makes sense?

Any suggestions welcome.

(Edit: downvoted too much and replies are throttled again) ----@lostdog Thanks for the detail! will definitely take this to eng team for process discussion.

1 comments

I work in an area where high variance is very expected and unavoidable. Here's what we do:

In your PR, you link to the tool showing the performance diff of your PR. The tool shows the absolute and relative differences of performance from the base version of code. It also tracks the variance of each metric over time, so it can kind of guess which metrics have degraded, though this doesn't work consistently. The tool tries to highlight the likely degraded metrics so the engineer can better understand what went wrong.

If the metrics are better, great! Merge it! If they are worse, the key is to discuss them (quickly in Slack), and decide if they are just from the variance, a necessary performance degradation, or a problem in the code. Typically it's straightforward: the decreased metrics either are unrelated to the change or they are worth looking into.

The key here is not to make the system too rigid. Good code changes cannot be slowed down. Performance issues need to be caught. The approvers need to be fast, and to mostly trust the engineers to care enough to notice and fix the issues themselves.

We also check the performance diffs weekly to catch hidden regressions.

IF YOUR ORGANIZATION DOES NOT VALUE AND REWARD PERFORMANCE IMPROVEMENTS, NONE OF THIS WILL WORK. Your engineers will see the real incentive system, and resist performance improvements. Personally, I don't believe that Atlassian cares at all about performance, otherwise it never would have gotten this bad. Engineers love making things faster, and if they've stopped optimizing performance it's usually because the company discourages it.