E.g. on AWS you might have all of a node's pod IPs on a bridge interface, then you talk to pods on other nodes thanks to VPC route table entries that the AWS cloud provider manages. NAT happens only when talking to the outside world or for traffic to Amazon DNS servers, which don't like source IP addresses other than those from the subnet they live in.
My memory is a touch fuzzy, but to route traffic out of a container in AWS, you have to either NAT thorough the instances network adapter, or attach an ENI to the container. However, you only get one ENI per vCPU in a VM (at least until Amazon finishes its custom NICs). What I'm really fuzzy on is whether the instance itself consumes one of those allocated ENIs.
That is, if you're running off a m4.2xLarge instance, you get a maximum of 8 ENIs - 8 containers if you want to use only VPC routing. For some services, this may be OK, but for many others (most?), it's far too few.
What's the destination? If it's the outside world, yes, you need NAT for state tracking and address rewriting, since the rest of the AWS infrastructure knows nothing about the pod CIDR (I guess you could set up a subnet for it and run a GW there).
For pod to pod, if you're OK with the limitations of 50 routes per VPC route table (you can open a ticket to bump that to 100, at the cost of some unspecified performance penalty), then you don't need NAT.
Otherwise, you can use something like Lyft's plugin, which does roughly what you describe. On a m4.2xlarge you only get 4 ENIs, but each of them can have 15 IPv4 and 15 IPv6 addresses, which the plugin manages. They assign the default ENI to the control plane (Kubelet and DaemonSets), so you should get 45 pods.
In my experience NAT is almost always involved in a Kubernetes setup (for on-prem).
The container network is generally not routable to the wider corporate WAN (it'll use RFC1918 addresses by default). You typically get one set of addresses for the main container network, a different set of addresses for the service IPs and then an routable set on the ingress.
What you describe is not NAT, the containers network segment is a separate network segment which is not accessible from outside the cluster, not directly and not through address translation. The ingress and service addresses are externally reachable addresses that expose services. NAT is not required for the setup.
That's inbound traffic coming from the outside world. You need NAT because the load balancer only knows about nodes, not individual pods (perhaps you can pull it off with e.g. ELBv2, but definitely not with v1).
There's more iptables magic if you talk to a service's virtual cluster IP, because of the load balancing, but from pod to pod, which is what I thought you were referring to, NAT is usually not involved.
Are you referring to the service cluster IPs? Those are great for short lived or low volume connections. If you want to balance load over long lived connections or have high volume, you really want to know the addresses of all your backends, whether that's done in your code or in a sidecar like Istio's.
A lot of it is due to an effort to make it work in as many environments with as few external dependencies (and environment control) as possible. The "simplest solution which could possibly work".
Personally, I'd rather just bring on ipv6. But, in my case, we don't have enough people who understand ipv6 (and it's barely supported in AWS) to use it ourselves.
Because that's the easiest thing to do when you don't know anything about networking. Ironically this also makes everything else much more complex and failure prone.
This is the answer surprisingly missing in the industry overall. It amazes me that I work with highly educated folks who cannot grasp some of the fundamental issues with k8s and the container ecosystem.
Because NATting encapsulates while routing doesn't? And encapsulation is the whole idea behind containers. Until everything is ready for IPv6 (lol, yeah right), NATting seems the only way to me.
https://kubernetes.io/docs/concepts/cluster-administration/n...
E.g. on AWS you might have all of a node's pod IPs on a bridge interface, then you talk to pods on other nodes thanks to VPC route table entries that the AWS cloud provider manages. NAT happens only when talking to the outside world or for traffic to Amazon DNS servers, which don't like source IP addresses other than those from the subnet they live in.