|
|
|
|
|
by EricBurnett
1860 days ago
|
|
Googler here. These people are smart and usually experienced, but don't put them on a pedestal. A lot goes into it - whole teams building the infrastructure with reliability as a core feature, and many hours spent running down all possible single sources of failure (so it usually takes a couple issues together to combine into a big incident). Good monitoring platforms, release systems, etc. Training and practice at incident handling, with many more folk only a page away. https://sre.google/sre-book/table-of-contents/ is a good source to start with. And ultimately, the perhaps most critical distinction - opportunity. You only work on a huge distributed system when there's enough customer demand for it to need to exist, and every large system started as a smaller system. That scaling of demand scales importance, which scales the effort invested in its reliability/scalability/efficiency/etc. You can read great stories about the many failures at Google, or Twitter (remember the Fail Whale?), or any other large company. The maturity you see now was developed over time, and any newly hired engineers will be trained into the culture of maintaining and improving it further. With few exceptions, the folk that incepted the big systems back when they were small aren't the ones scaling them out today anyways - it's a very teachable skill, given the need and opportunity. |
|