Hacker News new | ask | show | jobs
by atombender 3119 days ago
Network is simple from the container's point of view. It's less simple outside the container.

But outside the container, the strategy is still much simpler than other solutions (most of which predate Kubernetes). Kubernetes chooses to give every pod its own IP. This means choosing an internal network such as 10.x.x.x, and giving each machine a slice of it. This way, one single cluster shares the same big, flat space of IP addresses; not only do pods have the same IP inside the container, but they can talk to other pods using the other pod's IP, too.

But a key point is that Kubernetes is designed to take care of most of it. One part of it is the iptables proxy magic that it does to allow services to have dynamically assigned IPs, too, with simple load-balancing between them. The second part is the many built-in plugins for different, more complicated overlay strategies. Kubernetes' automatic configuration works out of the box on, say, AWS, without anything magical — Kubernetes natively talks to AWS to set up a routing table so that packets end up where they should. You don't need more complex overlay networking stacks such as Calico, Flannel or Weave right away.

As for ingress, it has absolutely been Kubernetes' weakest point for several years, and the Kubernetes team knows this perfectly well. That said, it's not complicated, thanks to the above. Once you have, say, Nginx listening on a port, routing traffic into the cluster is a matter of setting up a load balancer (at least on clouds like GCP, DigitalOcean and AWS), something which Kubernetes even can do automatically for you. The weak links are the ingress controller — the Nginx one is popular because it's stable and supports common features such as TLS, whereas others such as Voyager and Traefik are lagging — as well as the impedance mismatch with cloud LBs such as the Google Load Balancer.

So far, Kubernetes' ingress support has been generic: One ingress object can be used to "drive" different HTTP servers. The problem being, of course, that all HTTP implementations which have different settings (timeouts, TLS certs, CDN functionality) and concerns that the current, simple ingress format cannot support. I'm expecting this to change soon. Ingress portability really isn't an important concern, and the generic ingress format is a bottleneck for the ingress functionality to mature.

1 comments

>This way, one single cluster shares the same big, flat space of IP addresses; not only do pods have the same IP inside the container, but they can talk to other pods using the other pod's IP, too.<

Why is having a big, flat namespace important? Routers route. Clos L3 networks are no longer a fancy thing. They're commonplace now. I don't see any advantage of having a flat network.

> One part of it is the iptables proxy magic that it does to allow services to have dynamically assigned IPs, too, with simple load-balancing between them.<

Ah yes, the iptables "magic". We call this, slowness and obfuscation. People who understand how to run networks don't like handwavy magic. We like simple, elegant concepts. Kubernetes networking is very far from simple and elegant. It's blackbox "magic".

>You don't need more complex overlay networking stacks such as Calico, Flannel or Weave right away.<

I run a native L3 network so have no need for an overlay network on top of it. That said, I'd argue that the overlay junk is probably easier for non-networking-fluent developers to setup and run compared to routing in AWS.

Kubernetes networking can be summed up thusly: Great for developers who know nothing about networking but want to run at hyperscale. Terrible for people who actually know how to run networks properly.

Kubernetes ingress is garbage. Stop apologizing for it.

IPv6 would also get rid of 99% of these overly complicated hand-wavy solutions that Kubernetes proponents constantly tout as strong points. Give each node a /64, and you're set.

I was replying in the context of the newbie who was asking for assistance. Your reply is rather tone-deaf in that regard.

You're arguing against the value of a "big, flat namespace", yet you're also arguing for IPv6, which itself is a big, flat namespace? Do you see the contradiction, perhaps?

Dedicated CIDR for pods is important because it's simple. The symmetry is simple to explain, simple to understand; the same simplicity you'd get from IPv6.

Moreover, it's an abstraction that can be implemented however you want (custom routing on L3, SDN overlay, BGP). Not everone has a native L3 network. If you're on Google Cloud Platform, you get a virtual L3, but with other clouds, the networking is a bit more old hat. So again, simplicity and convenience. As for "overlay junk", the entirety of the Google Cloud itself is virtualized over what is probably the world's most sophisticated SDN overlay, so, well, some people's junk is other people's ragingly successful business, I suppose.

I'm not sure why you categorize the automatic iptables rules that Kubernetes set up as slow or obfuscated. It's only magical in the sense that Kubernetes automatically makes its cluster IPs load-balanced, a convenient system that you are in no way forced to use. If you have a better setup, feel free to use it instead.

We use Kubernetes ingress. It works. It could be better, but it's not "garbage". I really recommend against putting everything in such categorical terms. Everything in your comment is "junk" and "garbage", and the people who designed it (Google!) are morons who don't understand networking, somehow. That kind of arrogance on HN just makes you look foolish.

I'm struggling to understand why you'd want to manually assign a /24 to each node? that seems very 1990s

Can't each container be bound to a virtual network interface(macvlan) and use DHCP? That allows the network to configure and manage the address pool.

No fiddling with routing tables (well not for each node) and it allows peering of VPCs simply

/24 per node is one option, but not the only option. But that gives you max 254 pods per node.

The simplest option is to just use routing [1]. You don't have to use an SDN. Not sure if DHCP is one of the officially supported options.

I know there are people out there who use MACvlan/IPvlan. Some people discourage these types of virtualized networks because the packet manipulation can be inefficient (unless the NIC explicitly supports it; I believe some support VXLAN?) and can hamper the kernel's scheduling.

[1] https://medium.com/@rothgar/no-sdn-kubernetes-5a0cb32070dd

With respect coordinating loads of route tables, when its a flat network is nothing short of ludicrous.

Firstly _statically_ assigning an address range to each node is utter madness, firstly it limits the containers you can have. Secondly its terribly inflexible, its perfectly possible to have a beefy server have more than 254 containers running.

Thirdly it ties up a huge address ranges with _no_ flexibility. If you have nodes assigned to certain duties (like DB pods) then it can only realistically have a few containers. So the rest of the address range is wasted.

What is so frustrating is that all of this is automatically taken care of using DHCP and macvlan.

In the example thats linked, why isn't there a second adaptor on a different VLAN? Thats a far more simple and visible way of linking things together. I just don't see why you'd want to willingly fiddle with routing table when on a normal flat network its done for you, automatically.

> firstly it limits the containers you can have.

This is a config value; if you want more containers per node, use a /23 or a /22 instead. It's entirely up to the operator, there's nothing magical about the default choice of /24 (except for it being easier to perform arithmetic on).

> Thirdly it ties up a huge address ranges with _no_ flexibility.

If you're using 10/8, then you have 16 bits' worth of /24 subnets, so 65k nodes by default. It's true that there are some companies in the world that have to worry about this limit, but for almost everybody I don't think this is a real problem.

>The ARP table might be bigger, but thats a different issue.

But this is the problem that most designs are trying to solve. Large L2s are notoriously fragile. 1,000 nodes, 50-100 pods/node is a lot of ARPs. And sometimes you want partitions between endpoints for security/isolation.

I agree with you about static assignment of addresses. But that's why (most) CNIs work with a controller of some kind for IPAM.

IMO, the problem complexity is hard to compress. You need to distribute/manage MAC addresses, routes, and/or state. Different designs would favor one over another.

Then you just move the routing problem to your gateway/router, and it'll end up exploding because of too many routes in the table (one per container), instead of only one per container host.

Or maybe I'm wrong. :)

if its a flat network then there is only one route. The ARP table might be bigger, but thats a different issue.

There is no difference between this and VM hosts.