| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by confluence_perf 2036 days ago

Hi BillinghamJ,

You're right, I apologize for not being clear. We're targeting 1s for "Initial loads" on new tabs/new navigation, which I assume you're referring to. Our target for 'transitions' is different.

If however the numbers you're referring to are "initial load" numbers, then I'm not sure.

(edit: and action responses again are also a separate category. Our largest number of complaints are about 'page load times' in Confluence, so most conversations center around that)

3 comments

petters 2036 days ago

Initial loads should definitely be be <100ms as well.

But Jira currently is so slow that 1s would be a great improvement. I am using it at work and regret it, unfortunately.

BillinghamJ 2036 days ago

As a first step, 1s would be better than nothing for sure, but you need to be working towards a much tighter goal on a 1-2 year timeframe.

New load, you should really be hitting 200ms as your 95th percentile - 300ms or so would be decent still. "Transitions" should hit 100ms 95th, 150ms would be decent.

If you did hit 100ms across the board, you'd be rewarded by your customers/users psychologically considering the interactions as being effectively instantaneous. So it really is worth setting a super high bar as your target here (esp given you need a bit of breathing room for future changes too).

confluence_perf 2036 days ago

Thank you for coming back and clarifying. Do you happen to have links to any public testing results of other tools, or guidance to this specificity - would love to use them to build a case internally

Most of what we've seen online are nowhere near this level of detail (X-ms for Y-%ile for Z-type of load)

(edit: clarified request)

BillinghamJ 2036 days ago

I'm afraid I'm no expert on project management tools!

On what users experience as effectively "instantaneous", that's from experience on UX engineering and industry standards - https://www.nngroup.com/articles/response-times-3-important-...

On the other noted times, they're just a general range of what can be expected from a reasonably well-built tool of this nature. Obviously much simpler systems should be drastically faster, but project management tools do tend to be processing quite a bit of data and so do involve _some_ amount of inherent "weight", but that isn't an excuse for very poor perf.

That said, I imagine if your PMs do some research and go ahead and try using some of the common project management tools, you should get a good idea. ;) Keep in mind speeds to Australia (assuming Atlassian is operated mostly there?) will likely show them in a much worse light than typical perf experienced in the US/UK/EU areas.

The time to first load is derived from the fact that you're running essentially the equivalent of many "transition" type interactions, but they should be run almost entirely in parallel, so roughly 2x between "transition" and "new load" is a reasonable allowance.

confluence_perf 2036 days ago

Thanks for the link! Yes this is the general guidance we're using too (0.1/1/10s), and one that we're reinforcing at every level of the company. This link does have more detail than I've seen in other places though, so it's an interesting read.

However I've not seen guidance on whether these should be P90 or P95 or P99 measures for example though. We've selected something internally, but obviously selecting amongst three 'measurement points' could drastically change general user's experience.

(HN is throttling my replies so apologies for delay)

BillinghamJ 2036 days ago

The percentiles are a bit of a combination.

A big part is simply how far you are in your journey of getting good at performance - if your p50 is still garbage, there's not much point in focussing on your p99 measurements. You should be targeting the p99 long term, but focus on the p50/p90 for now.

It's super important to target and make long term decisions around the p99 though, because, e.g., making a 100x improvement is not possible through little iterative changes over 2-3 years. You need a base to work from where that 100x is fundamentally achievable, which requires thinking from first principles and slightly getting out of the typical product mindset.

I also find the typical product mindset tends to result in focussing a lot on the "this quarter/next quarter" goals, but neglecting the "8/12 quarters from now" as a result.

Beyond short term/long term goals, the choice is largely just down to what the product is/does. Even ignoring all current architectural choices, there are some fundamentals where certain things must always be faster/slower - e.g. sync writes will typically be a fair bit slower than reads, and typically occur much less often, complex dynamic queries which can't be pre-optimised require DB scanning but are much less common.

For these kinds of tools, where most of the interaction is reads, mostly on predefined or predefined + small extra filtering, and reading/writing on individual resources (ie tickets), you can get p99 numbers trending towards the 100ms mark eventually - there's very little which truly can't get to that level with clever enough engineering.

---

Of course I imagine Google tends to be looking more at their p99.9/p99.99/pmax/etc(!), at least for their absolute highest volume systems.

None of us are going to be getting to that point, but it's often worth thinking about engineering principles against a super high bar - it often helps people to open their minds a bit more and think more outside the box when given a really dramatic goal magnitudes beyond their existing mindset.

Of course you're not expecting to really get to that level, but anchoring that way can achieve amazing things. I've done that with a lot of success at my company and we actually did manage to achieve a few originally thought to be totally unrealistic.

texasbigdata 2036 days ago

Not to be a jerk, but you guys don’t allow others to take your performance metrics, but you’re publicly soliciting performance data from other products at the same time? I’m assuming you’re taking it for granted they don’t have a ToS that bans you from doing this.

Sorry if that’s pointed, but it’s sort of meant to be incredulous (but hopefully not offensive).

confluence_perf 2036 days ago

Not offended - as an employee I have no specific insight into the actual headline term in the ToS - honestly I'm planning on tracking down someone in legal to help clarify this, since it seems like it currently as written (and currently as interpreted in worst case) unecessarily impedes me from doing my job.

I would never encourage anyone to violate a ToS of another product and apologize for anyone that was considering doing it due to my ask.

I think these are other possibilities: 1) (as stated) other products don't have such ToS 2) other products may have published their own metrics and made them available for consumption 3) from a more legal in depth standpoint, maybe other companies have such ToS terms but have clarified them to some point that makes them more clear about when they apply and when they don't

texasbigdata 2036 days ago

Sorry you have to work on this thread on vacation dude (or girl). This thread has been an absolute beat down and you’ve treated it with utmost professionalism when it’s pretty clear you’re new to the team. It is Saturday night after all.

confluence_perf 2036 days ago

I appreciate the well wishes, honestly it means a lot - good guess too, I am in the US (maybe you knew, but most of Confluence Cloud is based out of the West Coast offices).

I don't know what the overlap is between Atlassian users and IT admins though - my previous job was on the vSphere UI and if you happen to know about the death of the Flash based client, this is not too far off.

Hopefully users stay willing to engage with us so we can improve the product as fast as possible.

deathweasel 2036 days ago

Do you guys use synthetic monitoring tools?

confluence_perf 2036 days ago

I don't think I know what that might exactly mean - I know we have synthetic traffic generation tools (and thus, measurements generated from the synthetic traffic), but I think those exhibit the same variance as production -> the backend for them are the same cloud IaaS systems and SW, so there's no 'sandboxed from all outside variance'.

If it means something else then I'm not aware if we do it or not.

launderthis 2036 days ago

no dont give into this guy ... this is done over the net. The rate of transfer has to be taken into account. Unacceptable is a measure of comparison.

Unacceptable to who, you have a faster provider for cheaper, with as many features???

Im pretty sure he doesnt because if he could he would go there. There are tradeoffs and Atlassian has many project they are working on. They understand that there is room for improvement in performance. Its one of Atlassian's priorities, it is a tech company (a pretty good one I would say).

I guess one question is about server redundancy. Where is this guy loading from and where is the server he is loading from? Getting things below 1s is nearing the speed of the connection itself. Also at that speed there is deminishing returns. Something that happens at 1s vs .5s doesnt make you twice as fast when you dont even have the response time to move your mouse and click on the next item in .5s.

Sometimes techies just love to argue. You are doing great Atlassian and have tons of features. But maybe it is time to revisit and refactor some of your older tools.

BillinghamJ 2036 days ago

You've shown poor understanding here.

> Getting things below 1s is nearing the speed of the connection itself

That is absolutely false. Internet latency is actually very low - even e.g. Paris to NZ is only about 270ms RTT, and you _do not_ need multiple full round trips to the application server for an encrypted connection - on the modern internet, connections are held open, and initial TLS termination is done at local PoPs.

For services like this - as they are sharded with customer tenancy - are usually located at least in the same vague area as the customer (e.g. within North America, Western Europe, APAC etc).

For most users of things like Atlassian products, that typically results in a base networking latency of <30ms, often even <10ms in good conditions.

Really well engineered products can even operate in multiple regions at once - offering that sort of latency globally.

> Im pretty sure he doesnt because if he could he would go there

Yeah, we don't use any Atlassian products - partly for this reason. We use many Atlassian-comparable tools which have the featureset we want and which are drastically faster.

> when you dont even have the response time to move your mouse and click on the next item in .5s.

There is clear documented understanding of how UX is affected by with various levels of latency - https://www.nngroup.com/articles/response-times-3-important-...

> Sometimes techies just love to argue

Not really, I have no particular investment in this - I don't use any Atlassian product, nor do I plan to even if they make massive perf improvements.

But I do have an objective grasp - for tools like this - of what's possible, what good looks like, and what user expectations look like.

> no dont give into this guy

I don't expect Atlassian is going to make any major decisions entirely based on my feedback here, but it is useful data/input for exploration, and I do feel it's right to point out that they're looking in the wrong ballpark when it comes to the scale of improvement needed.

jiggawatts 2036 days ago

To put things in perspective, the typical Jira 5-second page load time as reported by many people in this forum is equivalent to twice the round-trip time for light to the Moon!

It's the network latency equivalent of a million kilometres of fibre!

bloaf 2036 days ago

The internet is fast. Computers are fast. One second is enough time for my machine to download 10M data points and render them into an interactive plot.

https://leeoniya.github.io/uPlot/bench/uPlot-10M.html

In my mind, anyone doing UI development and seeing user interactions taking over 1 second should be asking themselves "did the user just try to operate on more than 10^6 of something?" and if the answer is no, start operating under the assumption that they've made a mistake.