Hacker News new | ask | show | jobs
by andsens 844 days ago
I never understood the appeal of service meshes. Half of their reason to exist is covered by vanilla kubernetes, the rest is inter-node VPN (e.g. wireguard) and tracing (cilium hubble). Unless I’m missing something encrypting intra-node traffic is pretty silly.

K8S has service routing rules, network policies, access policies, and can be extended up the wazoo with whatever CNI you choose.

It’s similar to Helm, in that Helm puts a DSL (values.yaml) on top of a DSL (go templates) on top of a DSL (k8s yaml), just that it is routing, authentication, and encryption on top.. well, routing (service route keys), authentication (netpols), and encryption.

It boggles the mind!

8 comments

I've worked on several k8s clusters professionally but only a few that used a service mesh, Istio mainly. I'll give you the promise first, then the reality.

The promise is that all of the inter-app communication channels are fully instrumented for you. Four things mainly 1) mTLS between the pods 2) Network resilience machinery (rate limiting, timeouts, retries, circuit breakers). 3) Fine grained traffic routing/splitting/shifting. And 4) telemetry with a huge ecosystem of integrated visualization apps.

Arguably, in any reasonably large application, you're going to need all of these eventually. The core idea behind the service mesh is that you don't need to implement any of this yourself. And you certainly don't want to duplicate all of this in each of your dozens of microservices! The service mesh can do all the non-differentiated work. Your services can focus on their core competency. Nice story, right?

In reality, it's a little different. Istio is a resource hog (I've evaluated Linkerd which is slightly less heavy weight but still). Rule of thumb: For every node with 8 CPUs, expect your service mesh to consume at least a CPU. If you're using smaller nodes on smaller clusters, the overhead is absurd. After setting up your k8s cluster + service mesh, you might not have room for your app.

Second, as you mention, k8s has evolved. And much of this can be done, or even done better, in k8s directly. Or by using a thinner proxy layer to only do a handful of service-mesh-like tasks.

Third, do you really need all that? Like I said, eventually you probably do if you get huge. But a service mesh seems like buying a gigantic family bus just in case you happen to have a few dozen kids.

One major usage of services meshes that I’ve come across is for the transparent L7 ALB. gRPC, which is now very common, uses long-running connections to multiplex messages. This breaks load-balancing because new gRPC calls, within a single connection, will not be distributed automatically to new endpoints. There is the option of DNS polling, but DNS caching can interfere. So, the L7 service mesh proxy is used to load balance the gRPC calls without modification of services.

https://learn.microsoft.com/en-us/aspnet/core/grpc/loadbalan...

Look, back in the day, things weren't encrypted, so you could listen in on your neighbor's phone calls, read their email, hack their bank accounts. Wireshark and etherdump and the most fun of all, driftnet. So, since then, everything has to be encrypted, lest someone hack there way to the family jewels. Never mind that the number of breaks to get there means there are usually bigger fish to fry. The important thing is to sprinkle magic encryption dust on everything because then we know it's Very Secure. (That's not to deride the fact that encryption is important, because it is, but sometimes it goes a bit far when there are other gaping holes that should be patched first.)
Usually, unless someone is really doing naive things, you will need to have access to a lot of almost physical things to sniff traffic. You almost need to physically have access to room where either the server or the client is, even with unencrypted traffic. People say; 'but they can sniff it at level3'; they sure can, IF they have actual access to level3 on a higher level than just using them for normal traffic. Hacked switch or router or so. Probably state actors can and do pull that off, but outside that, it's really not so easy to get to unencrypted traffic of just a random target. You still should encrypt things of course when you can, but you don't have to get quite that paranoid about it.

All major hacks are 0-days (well, not updated Wordpress is not necessarily 0-day; a lot of 0-days are exploited months or years later), stolen credentials (social engineering usually), brute force password hacks or applications that are left open (root/root for mysql with 3306 open to the world). Those have nothing to do with (un)encrypted traffic.

if you have the ability to execute code on a CPU, and that CPU is connected to a bus, and that bus is connected to a network card, you can sniff traffic. If you have data and business processes that include at least one entity A that lacks absolute trust at least one other entity B in your cluster, then the visible traffic of A by B is bad.
Yes, but if you know that I run unencrypted traffic on my network and if I tell you that, you still won't be able to get to any of that if you cannot get into our network. Even if I tell you that I host at provider X and the traffic is unencrypted until it hits our webserver, you still won't be able to sniff any of it without getting very intimate with someone who has deeper access. Just hiring a machine at the same provider and putting the card in promiscuous mode is not going to get you anything from us.
It's not just a specific actor targeting a specific entity though; it's any malicious dependency being ran in a privileged environment.
Yes, that's true. But then you might have bigger issues I would say. But agreed. It's a good reason to make sure it's all closed off.
Service Meshes are something necessary for a small portion of Fortune 500s which have 1000s of microservices. Sure you could use load balancers but it becomes cost efficient to move towards a client-side load balancer.

If you aren't a Google, Apple, Microsoft, ...etc scale company than a service mesh might be a tad overkill

You're close, but it's really when you have thousands of microservices using either shitty languages or shitty client RPC libraries where you can't easily perform client-side load balancing.

There are plenty of languages and RPC frameworks where you can solve this without resorting to a service mesh.

Practically, and to your point, service meshes solve an organizational problem, not a technical one.

I don't get this either. Doesn't the mesh become an scalability bottleneck just like load balancers?

On that scale I'd expect people to use client-selected replicated services (like SMTP), and never something that centralizes connections (no matter where it's close to).

You can always add observability at the endpoints. Unless your infrastructure is very unusual (like some random part of it costing millions of times more for no good reason, as on the cloud), this is not a big challenge; you add it to the frameworks your people use. (I mean, people don't go and design an entire service with whatever random components they pick, or do they?)

With Istio (envoy) you run a "sidecar" container in your pods which handles the "mesh" traffic, so it scales with the number of instances of your pods.
Oh, thanks. That does solve the issue.
So like a DNS SRV record with multiple entries. Or Anycast, if you're being fancy
Or IPVS... wait that's built into Kubernetes, it's kube-proxy.
kube-proxy operates at L3/L4 while service meshes generally operate at L7 so it can load balance on a per HTTP request basis. Particularly useful for long lived connections commonly used in gRPC and others.
Right? Like, I really don't understand the problem that service mesh solves that isn't already solved by more standardized technologies
Isn't kube-proxy already a client-side load-balancer?
I agree that intra node encryption, if implemented by sidecars, is just wasting CPU cycles.

Small note, unless it has changed recently, containerd default capabilities list includes CAP_NET_RAW, so hostNetwork=true pods can sniff all traffic.

I like that istio does mtls. It also helps with monitoring the requests.
I actually never understood the appeal of Kubernetes in the first place. I have production apps running on bare bones VMs serving millions of customers. Is this sort of complexity really necessary? At this point I would just consider serverless options. Sure, they would be a little more expensive, but that's a huge savings if we account for engineering teams' time.
Counter-take: I never understood the appeal of virtualizing the hardware. Is that complexity really necessary?

Of course there's tradeoffs, but I think it's a specific perspective that says that Kubernetes is any more complex than virtualizing the hardware and scheduling multiple VMs across real hardware.

I personally am of the opinion that you only really need one of these. Containers don't really need VMs and vice versa to get all of the benefits of abstracting away the fact that your hardware might break. It's just a choice of which abstraction you prefer to operate at (of course, clouds will put your containers in VMs anyway because they need that abstraction). It sort of sounds like the parent prefers the VM abstraction, and you prefer the container abstraction.
100% agree that containerising on VMs is ridiculous.
You didn't respond with any benefits of k8s though. What's the true value-add? Surely, coding an entire infrastructure in YAML is not it because that's horrific for anyone who wrote actually working software (one that didn't need 20+ commits of "try again" for a single feature to start working anyway).
I find it hard to believe that you've considered this for more than than two seconds and can't think of a single reason why k8s might be a good fit for someone's requirements. But here's one:

It's a curated, extensible API that provides a decent abstraction over a heterogeneous collection of hardware. Nobody's done that before, and it's extraordinarily useful for being able to define intent.

No-one forces you to use YAML, it's just a serialisation format.

Once you have a bunch of components that implement this API, it becomes trivial to deploy pretty much any level of complexity of containerised application, without having to care too much about the actual exact details of how scheduling, networking, storage etc. is implemented. Even better, I can hit two different clusters configured in two completely different ways with the same manifest and get roughly the same result.

It's the abstraction.

> I find it hard to believe that you've considered this for more than than two seconds and can't think of a single reason why k8s might be a good fit for someone's requirements.

That could also tell you that you're in a bubble of converts and have forgotten what it's like to live without k8s. ;) There are always at least two sides of the coin.

> No-one forces you to use YAML, it's just a serialisation format.

Really? So when do I get to describe my singular app that needs Postgres and Kafka in 7-10 lines in a .txt file and not {checks our company's ArgoCD backend repo} ... not 8 YAML files? I ain't got all day or a week, where is that? Nice declarative MINIMAL syntax with all the BS inferred for me (like names and stuff, why should I think of 5-10 names of stuff? Figure it out, automated tool!) that only concentrates on what is being deployed. It can generate everything else just fine, or at least it should, in a more sane world anyway.

Excuse the slightly combative tone, I ain't trolling you here but you also come across as a bit blind about things outside the k8s holy land.

> Once you have a bunch of components that implement this API, it becomes trivial to deploy pretty much any level of complexity

Nope, nothing ever becomes trivial with k8s. I was on call with two senior platform engineers and we needed 3-4 hours to describe a single small app that needs Postgres and Kafka and to listen on a single port while needing a single env var and a few super small config files. These guys provisioned an entire network of 500+ pods working perfectly for years. They made 10+ mistakes while trying to help me deploy this small app on Argo. (And they've done the same dozens of times at this point.)

Do with that info what you will but I'd strongly disagree if your takeaway is "they are not that good" -- because they have done quite a lot very successfully (with and on k8s).

> It's the abstraction.

My 22+ years of programming have taught me that people enjoy abstractions too much and make huge messes with them. I am not convinced that having an "abstraction" at all is even a good selling point anymore.

shrug

You originally said "I don't see how it's useful", and I observed that some people find it useful.

I'm sorry you've had a bad experience, but it's a bit short-sighted to extrapolate from that to "this is universally useless".

> you also come across as a bit blind about things outside the k8s holy land.

You're not the only one with multiple decades of experience in writing code and managing infrastructure. I've used a lot of the tools in the toolbox, I know when each is likely to be more or less appropriate.

> Really? So when do I get to describe my singular app that needs Postgres and Kafka in 7-10 lines in a .txt file

Could you post those 7-10 lines needed to fully manage a Postgres AND Kafka deployment? I'm by no means a master, but I have a decent amount of experience outside the "k8s holy land" and I have no idea how to accomplish that.

If you didn't have helm, you would be writing your own regex scripts. I don't see how this would be better.
If you didn’t have helm, you’d be using one of the other, much better tools, and be happier for it.
I generally meant, if you didn't have something like helm you would use regex. In any case, please elucidate, what tool do you like best?
Not the poster you're replying to, but when it comes to deploying your own applications: generating Kubernetes manifests with whatever language you're already using and feeding JSON to `kubectl apply -f -` can accomplish the same outcome with less effort.

Helm is still useful for consuming 3rd party charts, but IMO it's status as the "default" is more due to inertia more than good design.

In my own experience I started off doing this because I wasn't ready to learn helm. However, after using helm once I didn't see the reason to do it in my own code any more.
I get the sense for the responses (and downvoting) that I will eventually learn to dislike helm!
Which ones would you recommend?
Any, it doesn’t matter which as long as you don’t have to count spaces in yaml by hand.

If you really want a concrete recommendation try https://cdk8s.io/.

Spinnaker
CUE
Really? What's wrong with https://github.com/mikefarah/yq? Works quite fine.