Just today at $DAYJOB we had a complaint that one teams Azure Kubernetes clusters was "slow". About only metric out of norm was network traffic - but lack of detailed instrumentation meant we couldn't really isolate the cause to specific container or process
If the only explanation someone can provide is that "a cluster is slow", the issue isn't with network observability. They need to do at least the minimum level of analysis before escalating.
Yes, that would be great, but unfortunately there are application teams (particularly in the enterprise) lacking such tact when blaming infrastructure for issues.
Good old silos are alive and well, and ownership is not always part of the culture.
In our case the expected golden path is that once our team figures the proper procedure, we will establish it for the downstream teams that are direct supports of the application teams.
So at least in theory things are somewhat well set up, but there's too much siloing at our level (wildly separate network teams, teams for specific clouds, etc.)
It’s like Cilium + Hubble but useful for you don’t/can’t run cilium. Uses eBPF to collect metrics and stats on what flows where, can record an impressive amount of stuff, without any required instrumentation of your applications. Amazingly handy for when you run both first party and 3rd party apps in your K8s cluster. The network maps these tools produce are handy too.
Although, Cilium is pretty great, so not sure why you wouldn’t run it, given the option…
Cilium is a CNI - the functionality that provides the K8s cluster inter-pod networking. The fact that it uses eBPF to deliver its functionality is what gives it the impressive observability you usually only get from a service mesh. I agree that not everyone needs a service mesh.
Haven't used this but I tried out Pixie trying to debug where outgoing traffic was coming from and where it was going and was fairly successful although Pixie wasn't very stable/had a lot of issues causing crashes.
In this case, we had a couple services talking to 3rd party services running on AWS so it wasn't obvious from generic flow logs.
I also used Lacework a couple years ago which is eBPF based and it was pretty trivial to see things phoning home or one off maintenance where a new connection was being initiated.