| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jeffbee 1901 days ago
	Are there examples of high-utilization, large-scale Mesos deployments? Mesos didn't even gain over-commit until 2015, so it seems like it was generally behind the state of the art.

4 comments

dharmab 1901 days ago

Most famously, Siri (used to?) run on a very large scale Mesos deployment (10000s of nodes, much higher than Kubernetes can scale to).

Unfortunately the original article is lost, but here's a summary: https://daringfireball.net/linked/2015/04/29/siri-apache-mes...

link

thamer 1900 days ago

Wayback Machine has it[1], but there's not much more content than in Gruber's summary.

[1] https://web.archive.org/web/20150429225603/https://mesospher...

link

jeffbee 1901 days ago

OK but what was the utilization? I'm not really sure K8s is state-of-the-art either. There are published research papers about very-large-scale clusters with 80%+ resource utilization.

link

dharmab 1900 days ago

In our production experience, utilization had far more to do with the service owners (or autoscalers/auto-tuners) correctly choosing the cgroups and CPU scheduler allocations, as well as the kernel settings for cgroup slicing and CPU scheduler. We had Mesos clusters with 3% utilization and have Kubernetes clusters with 95%+ utilization. But we also have Kubernetes clusters with <10% utilization.

link

madhadron 1900 days ago

To be fair, Kubernetes right now only schedules relatively small clusters. But it turns out that the majority of the world is not Facebook or Google and only needs relatively small clusters.

link

davidopp__ 1895 days ago

> To be fair, Kubernetes right now only schedules relatively small clusters.

This is not really true:

https://news.ycombinator.com/item?id=25907312

https://cloud.google.com/blog/products/containers-kubernetes...

https://www.infoq.com/presentations/alibaba-kubernetes/

link

dharmab 1894 days ago

Even those numbers (10k to 15k nodes and 100k containers) are smaller than what a great Mesos framework was capable of.

Of course, this mattered to only a very small number of organizations.

link

madhadron 1888 days ago

Yes, 10k to 15k machines is a relatively small cluster in my world.

link

jqcoffey 1900 days ago

There must be some folks from Criteo lurking here. I'm an ex-Criteo'er and if memory serves we had something on the order of 10K nodes running mesos/marathon. We did all kinds of silly things to it, like running very CPU intensive .NET/Core apps.

I dug this post up showing a service performing an internal auction of up to 530M advertising campaigns/sec on 88K CPUs in Mesos: https://medium.com/criteo-engineering/migrating-arbitrage-to...

We also ran HiveServer2 and the Hive Metastore in Mesos, though that wasn't super CPU intensive (that was a pain, but mostly due to our Kerberos deployment).

The general use case of Mesos/Marathon always worked for us just fine (self-executable JVM apps), though there was plenty of Mesos hate at Criteo (and eventually Kubernetes spun up, though I left about a year ago and don't know its footprint).

PS, Hi Greg S! Hi Maxime B! <-- if you're reading :).

link

ultimex 1899 days ago

Silly like wrapme.sh? https://imgflip.com/i/54trj2

Only a guy harassed and fired for developing in go and showing k8s capabilities ;)

link

kraemate 1900 days ago

IIRC you could always overcommit in Mesos using DRF weights and accepting resource offers in your application. I could be wrong.

The larger point is that Mesos introduced a new, exciting way to do truly distributed allocation (where the cluster manager (i.e., Mesos) and various applications coordinated and cooperated in how they use computing resources). In contrast, Kubernetes is centralized, pretty vanilla, and I would love to know what new ideas it has introduced (from an algorithmic and architecture perspective).

link

bdd 1900 days ago

Twitter. From generic caches to ad serving; from stream processing to video encoding, all high utilization applications of either one or multiple schedulable resources.

link

jeffbee 1900 days ago

As MesosCon Twitter said their cluster utilization was between 20 and 30%.

link

bdd 1900 days ago

These jobs had their allotted quotas, per team, giving them above >70% utilization in their logical slice of the cluster. E.g. video processing team gets 20,000 nodes globally. They stack (co-locate) their tasks (interpret: set of processes) however they want.

Granted Twitter operated one big shared Mesos+Aurora offering for everything*, the whole cluster high utilization wouldn't give much flexibility to absorb load, or do reasonable capacity planning (which was an entire org in itself) when you own and operate those machines and data centers. I can't comment much on the 20-30% figure given in MesosCon, it's been more than 5 years since I was last privy to these figures.

link

streblo 1900 days ago

I worked for Twitter up until 2017 and when I was there it was much higher than 20-30%, definitely >50%. It's very possibly changed since then, but at least at that point in time Twitter was running Mesos on many thousands of machines.

link