Hacker News new | ask | show | jobs
by falcolas 3055 days ago
<rant>

And yet finding people who can reliably install K8s from scratch, who understand what's going on under the hood, remains remarkably close to 0.

How many people can, within a few hours, tell you how Kubernetes runs DNS, and how it routes packets between containers by default? How do you run an integrated DNS which uses, say, my_service.service.my_namespace instead of my_service.my_namespace?

I've found that most installs of k8s have been made using defaults, using tooling that Google has provided. We hired one such administrator, but when asked anything outside of how to run kubectl, they just shrugged and said "it never came up".

The codebase is vast, complicated, and there are few experts who live outside of Google. And it's getting more vast, more complicated on a quarterly basis.

It bothers me how far operations has gone from "providing reliable systems on which we run software" to "offload work onto the developer at any cost".

</rant>

I realize that a lot of this is because of scarcity. The good devops folks (i.e. those who are both competent generalist sysadmins and competent generalist programmers) are few and expensive. That makes pre-packaged "full stack" solutions like GAE, Kubernetes, and Fargate very appealing to leadership.

"You don't need an operations department to act as a huge drain on your revenue, just re-use your developers" holds a lot of appeal for those high up in the food chain. It's even initially appealing to developers! But in the end, it makes as much sense as re-using your developers to do customer service.

7 comments

This isn't a unique problem to Kubernetes, it's an issue in general within the industry. There are very few competent operations people, and you'd think they'd be in high demand but in actuality operations groups are heavily mistreated compared to their software development peers.

I've abandoned operations as a career path and have now gone into product management, but I was an operations person for more than 12 years. In that time frame I learned very quickly that upper management considered the operations teams to be "system janitors" and that developers considered operations engineers to be their inferiors. The "move fast and break things" attitude is great sometimes, except it gives license to shortsightedness.

The reality is that operations is not a specialized skillset, in fact it's a generalized skillset made up of being a specialist in multiple facets of complex systems. There's simply not that many people out there who have that level of knowledge and understanding, and the industry has both perpetuated this problem by treating operations people terribly and worked around this problem by focusing on building stacks that require minimal operational overhead. Any good operations person could have been a software developer, but wanted to get beneath the abstraction layers. Instead, we get treated worse, paid less, and have less job demand despite being more competent. Most of the best ops people I've worked with ended up either leaving ops entirely, like myself, or becoming software developers to get a pay bump.

Luckily I got to work for a few decent companies along the way in my career that treated me well and I made a lot of life-long friendships with very smart people as well. So don't read the above as some deep complaint. It's just an observation of the reality that the incentives aren't there for smart and talented people to invest their energy in operations. I advise most of the young people passed my way to become software developers. They'll have more autonomy, get paid more, have higher job demand, and get treated better in general.

+1 on that.

Operations is the highly-skilled sucker who is awakened at 3am everyday and never paid overtime. Don't be that guy.

I'm walking on the same route. I good with dev and ops both.

But been working as devops for 3 years now. I like it a lot. Especially automation part. What advice would you give me ??

How many people understand how the Linux kernel works from top to bottom? There are more than a handful of cloud providers (AWS, Azure, Microsoft, Alibaba etc) that offer a completely managed Kubernetes experience, for most folks, this will be good enough and you don't need to understand everything in order to take advantage of Kubernetes, similar how you don't need to understand how the kernel (think POSIX) works: https://www.cncf.io/certification/software-conformance/
You're right. You don't have to know anything about Linux to run software on it... until you do. Until you have to understand and modify swap. Until you have to understand and change the various schedulers (for both processes and disk operations). Until you have to troubleshoot networking problems. Until you have to change a kernel setting to avoid a 0-day exploit. Until you have to encrypt all communication because a client said so.

Being on AWS or Azure or Microsoft doesn't shield you from these needs.

The job isn't typically to be an expert from day one, the job is to learn and develop as things come up. Field experience is how you, over time, build those skills.
If you're going distributed the first place, doesn't that often imply that big team / big codebase?

In those cases, I'd argue things will likely come up quite quickly. Kubernetes is key component of a platform, but not a PaaS, e.g. you are required to understand to low level stuff, even if it's managed by a public cloud provider.

Cue out of memory apps, recovering GB+ JVM thread dumps out of a transient container, lack of troubleshooting tools, the kinda of stuff falcolas said above, plus high pressure to resolve because it's highly visible production app and you're got a recipe for sadness.

Even at google AFAIK, K8S ran in the context of BORG/ BORGMON and a host of other internal tools.

Most teams shouldn't install Kubernetes from scratch, but use a PaaS distribution like OpenShift, preferably with commercial support.

You need much more than Kubernetes: a secure (!) container registry, a container build system, deployment, log management, metrics...

It's fun to set up k8s from scratch, but there's little business value in reinventing the wheel all over again. Just like you wouldn't build your own Linux distro, you shouldn't do it with Kubernetes.

I've seen startups waste SO much time reinventing basic infrastructure instead of focusing on their product.

Honestly, I'm not even talking about startups here - it's established companies who have grown too big for the PaaS offerings, or who have specialized needs that PaaS providers don't offer. Such as an HTTPS enabled Redis cluster in AWS. Just recently started to become available, after years of our insistence for it.

Not to mention, the costs for PaaS providers don't scale up well (if they can even handle the load). They're great for startups on VC, but deadly for companies who want positive cash flow.

My question is this. Why does the container world use NAT.. ( 3 layers to get out of container to base host in k8s ) ... and not use routing ?

Is it just the container devs dont know routing ?

Kubernetes is the opposite. NAT is explicitly not required:

https://kubernetes.io/docs/concepts/cluster-administration/n...

E.g. on AWS you might have all of a node's pod IPs on a bridge interface, then you talk to pods on other nodes thanks to VPC route table entries that the AWS cloud provider manages. NAT happens only when talking to the outside world or for traffic to Amazon DNS servers, which don't like source IP addresses other than those from the subnet they live in.

My memory is a touch fuzzy, but to route traffic out of a container in AWS, you have to either NAT thorough the instances network adapter, or attach an ENI to the container. However, you only get one ENI per vCPU in a VM (at least until Amazon finishes its custom NICs). What I'm really fuzzy on is whether the instance itself consumes one of those allocated ENIs.

That is, if you're running off a m4.2xLarge instance, you get a maximum of 8 ENIs - 8 containers if you want to use only VPC routing. For some services, this may be OK, but for many others (most?), it's far too few.

What's the destination? If it's the outside world, yes, you need NAT for state tracking and address rewriting, since the rest of the AWS infrastructure knows nothing about the pod CIDR (I guess you could set up a subnet for it and run a GW there).

For pod to pod, if you're OK with the limitations of 50 routes per VPC route table (you can open a ticket to bump that to 100, at the cost of some unspecified performance penalty), then you don't need NAT.

Otherwise, you can use something like Lyft's plugin, which does roughly what you describe. On a m4.2xlarge you only get 4 ENIs, but each of them can have 15 IPv4 and 15 IPv6 addresses, which the plugin manages. They assign the default ENI to the control plane (Kubelet and DaemonSets), so you should get 45 pods.

AWS instances can do IP routing just fine. There is a flag to set when the instance is created or else it drops all traffic not from its own IP.
In my experience NAT is almost always involved in a Kubernetes setup (for on-prem).

The container network is generally not routable to the wider corporate WAN (it'll use RFC1918 addresses by default). You typically get one set of addresses for the main container network, a different set of addresses for the service IPs and then an routable set on the ingress.

What you describe is not NAT, the containers network segment is a separate network segment which is not accessible from outside the cluster, not directly and not through address translation. The ingress and service addresses are externally reachable addresses that expose services. NAT is not required for the setup.
If traffic flows from the pod network to an external network NAT is involved, as the Pod network is not routable.
I can see how it's more likely on prem, but at my job, we run Kubernetes in production on AWS and most traffic is pod to pod, without NAT involved.
That's inbound traffic coming from the outside world. You need NAT because the load balancer only knows about nodes, not individual pods (perhaps you can pull it off with e.g. ELBv2, but definitely not with v1).

There's more iptables magic if you talk to a service's virtual cluster IP, because of the load balancing, but from pod to pod, which is what I thought you were referring to, NAT is usually not involved.

No point in having a service you cant use :)
Are you referring to the service cluster IPs? Those are great for short lived or low volume connections. If you want to balance load over long lived connections or have high volume, you really want to know the addresses of all your backends, whether that's done in your code or in a sidecar like Istio's.
Look into Project Calico, they get it right: https://www.projectcalico.org/
A lot of it is due to an effort to make it work in as many environments with as few external dependencies (and environment control) as possible. The "simplest solution which could possibly work".

Personally, I'd rather just bring on ipv6. But, in my case, we don't have enough people who understand ipv6 (and it's barely supported in AWS) to use it ourselves.

Because that's the easiest thing to do when you don't know anything about networking. Ironically this also makes everything else much more complex and failure prone.
This is the answer surprisingly missing in the industry overall. It amazes me that I work with highly educated folks who cannot grasp some of the fundamental issues with k8s and the container ecosystem.
Because NATting encapsulates while routing doesn't? And encapsulation is the whole idea behind containers. Until everything is ready for IPv6 (lol, yeah right), NATting seems the only way to me.
You can encapsulate just fine with a routed architecture.

You still need NAT to talk to the outside world (your services are behind a load balancer either way).

Why do you need NAT ?
The reason why there are no people like that is that the vast majority of the K8s is driven by the teams that try to masquerade their lack of understanding of systems ( cloud or non-cloud ).

Building containers that contain entire operating system gives no wins. In fact it add additional layer that will create issues, will break in a different way, etc.

The current love of the modern orchestration system by the management is similar to mid-nineties love of the "compute management packages" running on SGI that showed one "flying" though from one server to the other.

> I've found that most installs of k8s have been made using defaults, using tooling that Google has provided. We hired one such administrator, but when asked anything outside of how to run kubectl, they just shrugged and said "it never came up".

What is up with this? The last time I tried to learn kubernetes I couldn't find any information about how to set it up. Just some set up tools from google. I guess it is still like this? Is there really no one running kubernetes infrastructure with config management or anything?

The post you're replying to is absolute hyperbole. If you're hiring k8s guys who don't know etcd and the backend of k8s (we're not going to understand every single gear, I constantly forget how k8s garbage collects, I never have to interact with it) then you're not hiring Seniors who have worked on k8s for several years. That's no different from hiring a linux admin who only knows how to fix Cpanel. You made a bad hire or your budget wasn't high enough to attract experienced talent.

I'm one of the most frequent commenters on #kubernetes-users so I'm very aware of the questions and issues that come in from new k8s users and I'd say an absolutely massive majority of the users are running in baremetal via kubeadm/kops/etc. Typically on AWS (NOT EKS). The #gke channel is literally 1/10th the size of the #kubernetes-users channel.

If you have questions about k8s post in #kubernetes-users. The community is extremely helpful.

A LOT of people deploy K8s clusters via Terraform/Ansible, as well.

Why are professionals who know k8s back and forth less common? 2 years ago k8s was 1.1 and we had no idea where the market was going and if it would take off like it did. It takes time to build up the community and expertise. There are a LOT of very experienced k8s users nowadays whereas there were not 2 years ago. Finding someone with 2+ years of k8s experience who isn't a Xoogler is fairly rare right now because 2 years ago it wasn't the market behemoth that it is right now. I don't work with Google but I just happened to get involved with k8s almost 3 years ago. We are out there.

If you can't find an answer ping me @mikej and I'll try to get you going in the right direction.

You call it hyperbole, yet you just also verified that the scarcity is a real problem.

> If you're hiring k8s guys who don't know etcd and the backend of k8s [...] then you're not hiring Seniors who have worked on k8s for several years.

> Finding someone with 2+ years of k8s experience who isn't a Xoogler is fairly rare right now because 2 years ago it wasn't the market behemoth that it is right now.

Indeed, there's not enough people who know how to run it for as broadly as it's spread; for how much it's hyped.

If there's a thousand people out there who have that level of experience, I'd be surprised. And in an industry running hundreds of thousands of clusters (or more!), that's just too few people.

Yeah, I'm not buying that there are thousands of k8s "Seniors" out there, and they just aren't the people that are being hired. I've been in operations for 20 years, and I think I would qualify for what you would call "good devops" in that I am a generalist with a wide breadth of experience in systems as well as programming. Kubernetes is a beast. We run it from scratch, in production, and just keeping up with the changes since the project inception can be maddening.

I did much of the early research POCs for my company when the idea of containerization really took off, and my deployments would seem to not have a shelf life of more than a few days before I would have to conform to some new method they came up with. I was using Tectonic when it was first released and the documentation would change underneath me as I would try to set up the clusters. It's a LOT to keep up with.

I can understand and explain to someone every protocol or idea underlying Kubernetes, sure, because they build upon standards that we have all used before in operations. But to try to understand how it is all working together within Kubernetes, and then add in the complex interplay if you are like us and integrate with non-k8s systems that have comlex firewall and routing rules now to allow the intercommunication... add in Calico or Flannel...Docker under the sheets with all its warts...it's a lot to manage. You need people that are engaged with the k8s project at a level that would normally be reserved for Googlers working on it.

Don't get me wrong... I like Kubernetes for the most part. I do agree that if you are planning to run in-house, you are in for some challenges, and that you will need a very high caliber of operations team to deploy and maintain it.

There are almost 24k people in the k8s.slack.io #kubernetes-users channel and it's only a small portion of the actual community. For instance I rarely see people from other huge K8s consumers, like Lyft, Zalando, Walmart, etc in the channel (high probability they just don't mention where they are or I never noticed).

I won't deny actual k8s experts are low in abundance right now. It's a complex platform in its near infancy. There are people brand new to k8s embarking on the journey to learn it every day in slack. Give them 6mo+ or another year and you'll have a few near experts and a ton of just generally experienced admins.

I don't feel as though it's any different from when I was working on my CCIE. When I was working on that there were only 18k other people out there who had CCIEs. I very rarely met one or even someone with just that level of skill (I'm not in the Bay Area/NYC). I had to ask questions through newsgroups and IRC and in IRC there were maybe 4-10 people at that level out of thousands. You could say there are thousands of networks out there that need a CCIE to run them but that isn't ever what happened; you'd have a CCIE basically lead from the top and their skill/experience would trickle down to lesser experienced netengs or they'd be brought in as consultants when necessary. I've worked with very few CCIEs. I see k8s going the same way. Every State will likely have a handful of experts on the subject while there will remain a ton of CCNA/CCNP level k8s admins and you just need to determine how complex your k8s infrastructure is and what level you'll need to hire to effectively manage it.

Brendan Burns goal is to democratize ephemeral infrastructure so that anyone who can code can manage it. That's another topic entirely but the community is starting to output enough general guidance in the form of blogs, books, slack, et al. that hopping into the ecosystem now is basically a breeze compared to what it was when I got involved in the 1.x days.

Could just be that I'm a masochist and like learning painful things.

My biggest complaint about k8s right now is the lack of real-world production knowledge being distributed. A lot of people set up a cluster and leave it and never optimize it or make it actually production-ready. My goal is to significantly accelerate that through training, blogs, etc.

18k CCIE is magnitudes more than the current amount of kubernetes experts.

It's also not comparable. A network is setup only once by a CCIE. It's done and it doesn't need touching for the next 10 years.

A kubernetes cluster is setup once and then the troubles begin. Constant care needed every week for 10 years.

The first real CCIE certificate awarded was in 1993, #1025. 18k was something in like the year 2005 when I was in my early 20s.

Kubernetes was barely a blip on the map until 2017. There was Cisco routing and switching hardware LONG before there was even a CCIE certification.

This is a new ecosystem. Of course it takes time to develop experts.

A CCIE not touching a network for 10 years? That's ludicrous. I was a network engineer. I wish that was even remotely the case. I was constantly fighting firmware/bugs/etc. In fact I swore off Cisco and began working on my JNCIE at some point to stick with Juniper which of course had it's own issues.

I've run k8s 1.5 in production for 2 years on various clusters. That is almost a 2 year old release. I've had zero k8s specific problems. I recently migrated those clusters to 1.9 and apart from updating some API endpoints that changed over the major releases, and a lot of annotations that changed, it was very little actual work. It was mostly tedious "find & replace" work.

I'm not going to bullshit people. K8s has its bugs, quirks and is complex, but there seem to be a huge number of people who run away in fear on HN.

It can be understood. I didn't even know how to use Docker before I jumped into learning k8s.

As someone who was raised in Operations, but fully bought into the dev/ops kool-aid. I'd argue that most of the unhappiness I've felt in operations positions has been due to being the bottleneck in organizations with lots of development teams that are depending upon our services. It is this, more than any technical benefit that I think systems like Kubernetes provide. This doesn't really answer your not many people know how to run Kubernetes point, but I might argue it is when the cost of managing the infrastructure beneath lots of different application exceeds the cost of learning Kubernetes that one should make the switch. I think this is probably somewhere around 25+ development teams.