Are there examples of high-utilization, large-scale Mesos deployments? Mesos didn't even gain over-commit until 2015, so it seems like it was generally behind the state of the art.
OK but what was the utilization? I'm not really sure K8s is state-of-the-art either. There are published research papers about very-large-scale clusters with 80%+ resource utilization.
In our production experience, utilization had far more to do with the service owners (or autoscalers/auto-tuners) correctly choosing the cgroups and CPU scheduler allocations, as well as the kernel settings for cgroup slicing and CPU scheduler. We had Mesos clusters with 3% utilization and have Kubernetes clusters with 95%+ utilization. But we also have Kubernetes clusters with <10% utilization.
To be fair, Kubernetes right now only schedules relatively small clusters. But it turns out that the majority of the world is not Facebook or Google and only needs relatively small clusters.
There must be some folks from Criteo lurking here. I'm an ex-Criteo'er and if memory serves we had something on the order of 10K nodes running mesos/marathon. We did all kinds of silly things to it, like running very CPU intensive .NET/Core apps.
We also ran HiveServer2 and the Hive Metastore in Mesos, though that wasn't super CPU intensive (that was a pain, but mostly due to our Kerberos deployment).
The general use case of Mesos/Marathon always worked for us just fine (self-executable JVM apps), though there was plenty of Mesos hate at Criteo (and eventually Kubernetes spun up, though I left about a year ago and don't know its footprint).
PS, Hi Greg S! Hi Maxime B! <-- if you're reading :).
IIRC you could always overcommit in Mesos using DRF weights and accepting resource offers in your application. I could be wrong.
The larger point is that Mesos introduced a new, exciting way to do truly distributed allocation (where the cluster manager (i.e., Mesos) and various applications coordinated and cooperated in how they use computing resources). In contrast, Kubernetes is centralized, pretty vanilla, and I would love to know what new ideas it has introduced (from an algorithmic and architecture perspective).
Twitter. From generic caches to ad serving; from stream processing to video encoding, all high utilization applications of either one or multiple schedulable resources.
These jobs had their allotted quotas, per team, giving them above >70% utilization in their logical slice of the cluster. E.g. video processing team gets 20,000 nodes globally. They stack (co-locate) their tasks (interpret: set of processes) however they want.
Granted Twitter operated one big shared Mesos+Aurora offering for everything*, the whole cluster high utilization wouldn't give much flexibility to absorb load, or do reasonable capacity planning (which was an entire org in itself) when you own and operate those machines and data centers. I can't comment much on the 20-30% figure given in MesosCon, it's been more than 5 years since I was last privy to these figures.
I worked for Twitter up until 2017 and when I was there it was much higher than 20-30%, definitely >50%. It's very possibly changed since then, but at least at that point in time Twitter was running Mesos on many thousands of machines.
Unfortunately the original article is lost, but here's a summary: https://daringfireball.net/linked/2015/04/29/siri-apache-mes...