|
|
|
|
|
by zzzcpan
2887 days ago
|
|
So, how do you choose that service level objective? How do you know which solutions to implement to not make things "overly reliable"? Isn't that more important question? As doing this without some sort of methodology will almost always result in useless solutions and overpaying to cloud and other hosting providers. Like implementing rather expensive failover within the datacenter, while ignoring how unreliable datacenters are and how cheaply you can implement failover between datacenters via DNS. I like the idea of modelling availability/reliability for this. Even if you don't have the right numbers and do it on a napkin, not in code, it still can highlight solutions with best cost/benefit ratios. |
|
There's an excellent talk by Google VP of SRE Ben Treynor: https://www.youtube.com/watch?v=iF9NoqYBb4U. tl;dw: try to measure actual user experience, and make sure that even the long tile of customer still gets a good product experience. What "good product experience" means depends, on your product.
The rest of the error budget is for you to spend on releasing new features, changing the underlying architecture, etc.