Hacker News new | ask | show | jobs
by confluence_perf 1989 days ago
Thanks for the link! Yes this is the general guidance we're using too (0.1/1/10s), and one that we're reinforcing at every level of the company. This link does have more detail than I've seen in other places though, so it's an interesting read.

However I've not seen guidance on whether these should be P90 or P95 or P99 measures for example though. We've selected something internally, but obviously selecting amongst three 'measurement points' could drastically change general user's experience.

(HN is throttling my replies so apologies for delay)

1 comments

The percentiles are a bit of a combination.

A big part is simply how far you are in your journey of getting good at performance - if your p50 is still garbage, there's not much point in focussing on your p99 measurements. You should be targeting the p99 long term, but focus on the p50/p90 for now.

It's super important to target and make long term decisions around the p99 though, because, e.g., making a 100x improvement is not possible through little iterative changes over 2-3 years. You need a base to work from where that 100x is fundamentally achievable, which requires thinking from first principles and slightly getting out of the typical product mindset.

I also find the typical product mindset tends to result in focussing a lot on the "this quarter/next quarter" goals, but neglecting the "8/12 quarters from now" as a result.

Beyond short term/long term goals, the choice is largely just down to what the product is/does. Even ignoring all current architectural choices, there are some fundamentals where certain things must always be faster/slower - e.g. sync writes will typically be a fair bit slower than reads, and typically occur much less often, complex dynamic queries which can't be pre-optimised require DB scanning but are much less common.

For these kinds of tools, where most of the interaction is reads, mostly on predefined or predefined + small extra filtering, and reading/writing on individual resources (ie tickets), you can get p99 numbers trending towards the 100ms mark eventually - there's very little which truly can't get to that level with clever enough engineering.

---

Of course I imagine Google tends to be looking more at their p99.9/p99.99/pmax/etc(!), at least for their absolute highest volume systems.

None of us are going to be getting to that point, but it's often worth thinking about engineering principles against a super high bar - it often helps people to open their minds a bit more and think more outside the box when given a really dramatic goal magnitudes beyond their existing mindset.

Of course you're not expecting to really get to that level, but anchoring that way can achieve amazing things. I've done that with a lot of success at my company and we actually did manage to achieve a few originally thought to be totally unrealistic.