Hacker News new | ask | show | jobs
by kronin 2990 days ago
Seems that metrics providing visibility into the "network connectivity was flaky", like looking at response times (particularly 95/99 percentile) and digging into the pod, which gives you the node, would have isolated the problem pretty quickly to a single node. If a problem is isolated to a node, first thing to look at would be node logs. Would that pattern not have worked in this case?