|
|
|
|
|
by tonymet
40 days ago
|
|
MTBF = optimizing quality (reliability, uptime, correctness) of AI product MTTR = optimize the ability to correct failures when they occur. He's describing leaders who believe quality no longer matters because any faults or deviations can be corrected so quickly that it doesn't make any sense to waste time on quality. |
|
- What alerts are we missing that could have helped us catch that earlier?
- What dashboards could we have had to help diagnose the issue quicker?
- What Ops tools could we have had to help mitigate such issue quicker?
- What extra logging/metrics/telemetry could we add to help us catch this quicker?
- What “safe deployment practices” could we have employed to avoid/improve this?
- what processes could we enforce to facilitate all of that?
Rinse and repeat that few hundreds or thousands of times while mounting MTTR KPI and you will see that number improve. Most likely through your team “gaming it”
MTBF is much, much, tricker to measure or “manage out”. It’s about “excellence in engineering” which is not measurable nor controllable. You want a random feature X. Your team tells you it’s really not how the system works, and they want few months making the change slowly while observing the system. But you don’t want just X, you want X, Y, Z, W, V, Q, A, B, C, D, all the way throw AAZZW12. So you tell the team to go fuck itself.