Hacker News new | ask | show | jobs
by csdreamer7 13 hours ago
What specific issues did you have with OpenShift at high scale? How long ago was it?

Curious since it is the truly real open source one.

1 comments

It's a long story: OpenShift was a bad product all around until 2019 with v4, the 3rd rewrite, but that product was a home run. That in itself was an incredible turnaround, even before they moved away from Openstack and turned Openshift into also a VM platform.

Mostly the other problems are the typical problems of managing bare metal multi-tenant Kubernetes cluster. The customers that don't have as many of these problems are ironically running openshift on vSphere ;).

while the OCP operators and GUIs cover much of the usual day to day , you really need deep Kubernetes expertise at scale, and need to drop down to the upstream project code and docs. For example it is very hard to force configuration discipline on tenants (leading to many flowers blooming here like Kyverno); security in Kubernetes is complex and requires careful tradeoffs on policies; it is laborious and counterintuitive (requests vs limits - ie. you should always set requests and be very careful setting limits) to manage compute capacity and noisy neighbours, Submariner and OVN-Kubernetes network services are limited compared to HCX+NSX (eg. NAT topologies, distributed firewall management, tunnels, fabric connectivity ie. VRFs or EVPN support though this is coming soon... also Openshift's metalLB for ingress load balancing is its own thing with its own connectivity config), out of the box observabiity is not very good and requires 3rd party solutions or extensive customized configurations , and the Kubernetes scheduler itself is focused on efficient bin packing rather than workload stability.

Also replacing vSphere VMs with OSV, you lose DRS which is a big blow... you do keep vmotion live migration equivalence but you must use a NetApp Filer (or any NFS store) for your VMs, or Nutanix Files, or ODF/Ceph in RWX volume mode. ODF/Ceph is more laborious to manage than VSAN (it requires its own knowledge well), but importantly has native S3 object storage, which VSAN still is missing (though I hear it is imminent in VCF 9.1.2). VLAN assignment to VMs with NMstate and multi-NIC failover has gotten better here over the years with OCP though feels shakier (more complexity is exposed, LACP is required, etc) than the VMware distributed switch native load based NIC teaming or NSX.

Overall if you squint, OpenShift can replace much of vSphere on paper , and at least somewhat in practice - but you really, really need a sharp ops team that knows what they're doing and at least some 3rd party solutions for capacity and observability. I'm also not sure redhat education and consulting is scaled at the level required to build these skills in industry quickly enough, though IBM certainly has the qualifications to do so. That said Broadcom is also doing plenty to squeeze or shed its education and consulting to partners which is ... a mixed bag usually at first that doesn't end well, and leads to repatriation.