|
|
|
|
|
by theojulienne
2400 days ago
|
|
This is a great question, thank you for asking! Initially a few teams around the org had folks investigating poor performance from different perspectives of the applications that were observing issues. Once it was clear that it wasn’t the applications themselves or their configuration at fault, the team that runs our Kubernetes infrastructure started collating information together (in github issues) and getting to the point of having a clear repro (the Vegeta test) and what to look out for. This was the slowest part of the process because we needed to understand that something non-application-level was going on (and because “random network latency” is a very difficult thing to narrow down) - it probably took on the order of months from the first sign of an issue to fixing all the other issues that were contributing to small amounts of latency and being sure we still had an underlying problem to find. At that point it became clear that something more low level was going on, we put together a focus team from a selection of teams to investigate the underlying cause - that was a group of about 5 engineers actively working on it, with another 5-10 interested engineers following along and helping out. Folks were typically working in pairs or solo to dive in to different potential leads, looping in everyone else in Slack as they go. Most of the work here was finding signal in the noise, we found a lot of other smaller system-level issues along the way that got ruled out and/or low priority to fix. There were other DNS related issues at play, fixing those also improved things, but not the specific underlying issue in the post here. Going down the specific path in the post took just a few days once the first few steps showed something was wrong at the packet level. The remediation from there was also just a few days, because we already had infrastructure in place to detect a known issue and mitigate in a safe/graceful way. The focus team was working on this as a primary task for around a few weeks overall. |
|