Hacker News new | ask | show | jobs
by tasqa 2004 days ago
What I'd like to see for change is actually doing the bare metal part itself. I've seen so many k8s showcase posts of this or that, but never actually someone who's running it on actual servers they own and without using any big four cloud API's (I consider Equinix to be part of those too soon...) to handle the LB/Ingress/Network virtualization stuff they provide and still say it is easy to use..
8 comments

I managed a bare metal cluster of 5x 128 GB RAM for a fintech.

Using bare metal servers without VM layer is actually a simplification. Cutting out a layer that's not strictly necessary.

Test environments were in AWS. There is a load balancer outside of the cluster (highly available HAProxy as a service). I wouldn't say it's particularly difficult or easy. It's pretty cost effective. After the initial setup, scripting and testing, is done, you spend at most a few hours per month with maintenance and the difference in cost of severs is huge. Also, unmetered bandwidth.

The pain points are mostly storage (nothing beats redundant network storage ala EBS) and having to plan at least a few months in advance because you're renting larger chunks of HW.

Where do you see the difficulty?

I've installed k8s with ansible on baremetal (kubespray), more or less just followed the steps here: https://kubernetes.io/docs/setup/production-environment/tool...

No network virtualisation, just Calico. Announce the service ips via BGP from each node running the service and ECMP gives you a (poor mans) load-balancing. Ingress gets such a service-ip. I used simply nginx.

Important here though is, that the router needs to be able to do resilient hashing: Removing a node or adding a node otherwise causes a rehash of all connections leading to breaking connections.

I guess you don't even realize how cryptic is your post for someone uninitiated :)

Calico? Network virtualization? BGP? ECMP? Resilient hashing?

No big surprise all this stuff is easy for you.

I was assuming since tasqa wanted to know, how it works on baremetal in contrast to on the cloud. And since they brought network virtualisation up, that they were already knowledgeable about the networking part.

Networking is handled in kubernetes with CNI plugins, Calico is one of them. They define how one pod can talk to another.

Probably best described in how it does it is by the project itself: https://docs.projectcalico.org/about/about-networking

My simplyfied version: Calico uses the IP routing facilities to route IP packets to pods over hosts. Either from another pod or from a gateway router.

BGP is a protocol to exchange routing information, so it can be used to inform the router or kubernetes nodes (in this case physical hosts) about where to send the IP packets.

If a pod is running on a node, the node announces with BGP that the pod IP can be routed over the IP of the node. If the pod provides a service (in the kubernetes sense), the node can also announce that the service IP can be routed over the same host. Now, if two pods on different nodes are providing the same service, then both are announcing the same service IP. So, there are multiple routes or multiple paths for the same IP. That are the last to letters of the acronym ECMP (Equal Cost Multiple Path). Equal cost, because we do not express a preference over one or the other.

The router then can make a decision where to send the packets to. Usually that is done by hashing some part of the IP packet (IP and port of source and target for example).

Now the question is how is that hash deciding to which host it goes? In most cases it is very simply that you have an array of hosts, and the hash modulo the length gives you the host. Problem is, if you add or remove one item from that, practically all future packets will end up at a different host than before you did so. And they don't know what to do with it, breaking the connection (in case of TCP). Resilient hashing describes a feature that the mapping won't change under changes.

You may enjoy the parts on ARP at the end of the post. I am planning a post on HA K3s with etcd on netbooted RPis. The netbooted RPi part is already available for free to my GitHub Sponsors. Here's a gist I put together whilst figuring out how it should look: https://gist.github.com/alexellis/09b708a8ddeeb1aa07ec276cd8...

Not sure what you mean re: "network virtualisation" though?

we are running k3s on metal in production. works great actually. we use haproxy as ingress and lb.
How stable do you see it ? What's your cluster size? How long has it been running? Any tips for how to approach starting such setup?
we scale up to about 100 machines. We use spot instances EXTENSIVELY. And that configuration was tricky actually. Its been a couple of months now. Works pretty ok.

k3s is actually pretty simple to use now. the tricky part was to integrate with https://github.com/kubernetes/cloud-provider-aws and https://github.com/DirectXMan12/k8s-prometheus-adapter

The hardest part is to get it to work with spot instances. we use https://github.com/AutoSpotting/AutoSpotting to integrate with it.

Wow! Impressive... what are the advantages over EKS? Just cost benefits or others as well?
We want to have the exact same stack running on laptops and cloud. That's the main goal. Everything else is secondary.
The HA control plane and service type LB is all handled by https://kube-vip.io and is designed to be as transparent to the user as possible.
I recently set up a homelab of ~10 k3os+k3s nodes on NUCs. Setting up MetalLB on top of the base k3s installation made exposing services on their own IP addresses pretty dead simple.
I've seen a couple presentations by Chick-fil-a explaining things they've encountered doing this. You can find them on youtube, and here is an article: https://medium.com/@cfatechblog/bare-metal-k8s-clustering-at...
same, I tried for my test environment server but there were too many undocumented configuration steps, eventually just went with a single node minikube and that was that, I'd love an article with all the kinks worked out.