|
|
|
|
|
by rfreiberger
1599 days ago
|
|
The name needs to change but also the attitude that as engineers, we build complex systems and assume everyone has the knowledge how to use it. A few world wide outages I've been a part of was caused by a task runner which didn't lint the command and allowed a broken bash one-liner to be executed across every system in parallel. Yes, it's a simple mistake but how was a system allowed access to our global environment that this edge case was never calculated? In many of the meetings, the common issue is communication even between co-workers on the same team, and between internal platform providers. One case was an outage on the storage backend and realized after a long meeting that the internal SLA was much greater than we expected (and which the systems would timeout). It only worked for so long as storage utilization was extremely low. |
|
Programming culture has almost football field level of "No time for weakness" attitude.
If I see a possible failure mode of a system, and bring it up, someone's going to tell me to stop being a clicky click windows idiot and learn to be careful.
Trying to prevent human error in software isn't seen as a priority so nobody does it. They are concerned with the most reliable code rather than the most reli4 code-user-hardware-task-schedule-conditions system.
Programmers need to accept software fixes for human and hardware failures. It's a lot easier to add a confirmation dialog than it is to somehow become 100% reliable at not clicking the wrong thing.