| Failure is proportional to change. A growing company is frequently changing. A company that launches new features is changing. A company trying to fix architecture is changing. The large work forces a lot of valley companies have is built around and justified by this growth/change. The change that twitter will likely experience now is machine failure (3/1000 a day probably), hard drive expiration, potentially database promotions. Failures of cache machines. Automation can drive a lot of these to very small workloads, but capacity management is a potentially existential crisis looming over all tech companies. Then you get to the real problem that twitter faces. Political change, security change, and workforce rot. Political/regulatory change poses a problem because it often requires changes to infrastructure. This creates the type of change that can result in failure. Security change can be supply chain problems or bug reports. Maybe keys need to get rotated, new encryption added, software updated. All of these are change. All can result in failure, and potentially catastrophic failure. Lastly, the largest existential problem is that the engineers left at twitter are likely not their best and many of them are probably coerced into staying due to H1B regulation. Now you run into a problem of attrition and replacing that attrition. When your good engineers leave (or are over worked), it's harder to hire good engineers. The difference between a good engineer and a bad engineer is their `complexity to result` ratio. Good engineers can create simple solutions, while bad engineers create complex solutions, even though both might produce the same end result. Failure is also proportional to complexity and outage duration is most impacted by complexity. |
No serious engineer likes complexity for the sake of complexity. This may only apply to juniors practicing RDD (Resume-Driven Development).
Although there are times when a simple solution is not obvious even to the seniors, but these are generally very rare cases.