Hacker News new | ask | show | jobs
by musk_micropenis 1293 days ago
I would like to understand why Mastodon requires such a huge amount of hardware for mediocre traffic volumes. Not just the lazy "it's Rails" answer - I know Rails is a resource hog, but that doesn't go far enough to explain the extreme requirements here.

As a point of reference, look at what Stack Overflow is run on. As a caveat, SO is probably more read-heavy than Mastodon, but it also serves several orders of magnitude more volume (on a normal day in 2016 they would serve 209,420,973 HTTP requests[0]). They did this on 4 DB servers and 11 web servers. And in fact, it can (and has) worked serving this volume of traffic on only a single server.

With this setup SO was not even close to maxing out their hardware (servers were under 10% load, approximately). SO also listed their server hardware[1] in 2016. I don't know enough about server hardware to assess the difference, but to my eye they look similar on the web tier with similar amounts of memory, similar disk, etc.

I'm not saying Hachyderm is doing anything wrong, but it makes me wonder if there's a fundamental problem with the design of Mastodon. And to be clear I understand that this particular issue was caused by a disk failure, but that they even had this hardware in place running Hachyderm is surprising to me.

[0] https://nickcraver.com/blog/2016/02/17/stack-overflow-the-ar...

[1] https://nickcraver.com/blog/2016/03/29/stack-overflow-the-ha...

5 comments

I would like to point out that you asked this same question a few days ago and got several answers:

https://news.ycombinator.com/item?id=33855686

I would recommend people read that thread before responding with the same answers.

Every post and every reply generates a request to all servers you are federating to: https://aeracode.org/2022/12/05/understanding-a-protocol/
This ultimately polynomial direct-connection approach is clearly fatally unscalable.

The fediverse needs to figure out a hubs-and-spokes or supernodes pattern so that service providers can scale up syncing, indexing etc.

ie, my personal instance should be able to offload most of the message passing to an supernode intermediary that lots of other instances use for federation so that my instance only needs one connection, and the supernodes only need to connect to each other and their local network.

They do have it for some things already. There is this concept of "relays", which you can use as a feed of the data from the larger instances. But AFAIK it's used only as a content source and it's not something that you can set up now as a way to help with scalability.

I am also closely following https://github.com/nostr-protocol/nostr to see how they go along, because I am growing weary of the "tech elite" that is moving to Mastodon and is pushing for "moderation by committee". I've gotten myself with discussions already with people who actually want server operators that want only to open federation for those that abide by some "Covenant". This seems rooted in good intentions, but it reeks of something that might lead to a corporate copout of a network which is supposed to be open.

> This ultimately polynomial direct-connection approach is clearly fatally unscalable.

Is it? I mean, let's assume you post something popular, and 17,000 servers request it. How many people do 17,000 servers cover? Like even if we assume only 10 people per server. That's 170,000 people. How many people have 17,000 followers, nevermind 170,000 ? And 10 per server seems implausibly low for an average.

17,000 hits is... not particularly notable from a server perspective, triply so when they're all requesting the same "just posted" item which is still cached.

Sure if like, someone with multiple millions of followers is on your server you're going to have issues, but seriously, twitter had issues with that too.

Also keep in mind that there's no algorithm pushing people towards the same "popular" posts. Things grow organically.

So, at what point are you suggesting that this becomes "fatal" and is that a point that anyone that isn't hosting literal superstars is going to encounter?

> I mean, let's assume you post something popular,

The issue is that there's no federated "popularity" metric. Every user that has followers on 17,000 servers has every single one of their posts pushed to all 17,000. Automatically and immediately, not on demand. Posts by users with only a few followers will occasionally go viral, but an outsize portion of the load is due to "whales".

I guess this goes some way to answering my thought,

> but it makes me wonder if there's a fundamental problem with the design of Mastodon.

I also note the article says,

> During the month of November we averaged 36.86 Mbps in traffic with samples taken every hour

That seems like a large amount of bandwidth to service 30,000 users (who knows what fraction of them are actually active at any given moment). But I guess there's going to be a lot of video and image content. I have tried searching all of their linked blog posts about scaling but can't find any number that might map to requests per second without making huge assumptions.

> I would like to understand why Mastodon requires such a huge amount of hardware for mediocre traffic volumes.

There's some inherent overhead in a federated model (vs a single-source one), and the ActivityPub protocol Mastodon happens to use, wasn't necessarily designed to be the lightest possible thing in all use-cases.

Also, there's just a lot more traffic. My instance said, after Twitter's major struggles, they saw something like 30x more traffic and 20x more daily registrations. For instances that, prior to the influx, were running by volunteers in spare time out of people's bedrooms or small cheap VPS's and such.

These instances weren't necessarily ideally performance-tuned prior to the influx (and even if yours was, the remote ones your users might need to hit to fetch content from may not have been)

Interesting username you have there.

I don't see why you don't accept "it's Rails." There are other issues as sibling comments have pointed out, but by starting with an ecosystem known to have performance limitations, this sort of outcome is inevitable, is it not? I'm sure the Mastodon team were never expecting the degree of usage which has been thrust upon its larger instances, but now that it has happened and the limitations have become apparent, I'd encourage people who are interested in setting up fediverse/"Mastodon network" instances to consider the alternatives to Mastodon, however paltry they currently are.

I know that the Pleroma front end and its forks are written in something called Elixir, which I have no idea about but I can't imagine it could be much worse than Ruby. What I'd really like to see is something written in a language known to be actually fast, though - PHP or Lua.

I vouched for this comment because it's a good question, although I'm worried that your account is not long for this world with an inflammatory username like that.

The problem probably starts with the inefficiency of RoR, as you've guessed. Mastodon is a very dynamic site which limits the amount of caching that can be done, and there are hot code paths like filtering streams using a user's block lists and word filters that are not particularly optimized - all this happens in Ruby.

But there are other inefficiencies, compared to SO:

1. Mastodon is a media heavy site, with a lot of uploading by users. Mastodon has to convert user-uploaded media to standardized representations (e.g. JPEG and h.264), which takes a lot of CPU time.

2. Mastodon has a "firehose" feed which is available in the UI and actually used by many users. Filters apply to the firehose feed as well. Obviously this requires quite a lot of bandwidth and processing.

3. Federation is a weakness when it comes to traffic. If user X has an account on server A, and at least one user on 1000 other instances follow user X, server A has to immediately send any posts to all 1000 other instances, regardless of whether anyone on the other end will ever deliberately view them. (Of course, some users may view them in their instance's firehose feed.) The instance then has to duplicate this traffic when sending it to the actual subscribed users. By this standard both large non-federated "servers" (like Twitter) and widely federated pull-only servers (think RSS) are more efficient than ActivityPub (the open standard Mastodon uses).

4. Federation is a weakness when it comes to trust. Instances do not (and must not) fully trust each other, except for things like "@x@thisinstance said 'P'". So for example, the little Open Graph based preview cards you're used to seeing on Twitter and elsewhere have to be generated for links per instance. The first time a Mastodon server sees a link, it must fetch that link and generate a preview card itself. Because new posts by popular accounts are syndicated immediately, this is a burden on websites as well. https://www.jwz.org/blog/2022/11/mastodon-stampede/ (note: copy link or disable sending referrers from HN for this site)

5. Scaling is not really a solved problem yet for Mastodon, because in practice it hasn't had to be. It's easy to pass the buck to instance operators, who end up needing a $20/month VPS to run a small instance rather than $5/month. Even the very biggest servers are scarcely larger than 1M users. At that kind of scale you can patch over performance problems by just throwing more hardware at the problem - and e.g. mastodon.social has the funds from Mastodon (the org) to do that. Note that Hachyderm, AFAIK, is an obvious example of this; it was started by a tech worker in Seattle with much better access to expensive hardware than most casual instance operators can dream of. It's not surprising that they can pull the funds together to scale up before they start seeing performance issues.

In practice #3 is the only one that matters. For reducing dynamism/increasing caching potential, it would be fairly easy to run a fork of the site with the more dynamic features excised (donate $1 a month to get access to dynamic features like filtering). For media transcoding, that's a textbook case of a CPU-bound operation that you could offload to an isolated Rust component for a CPU savings of 99% compared to Ruby (not an exaggeration). But the exponential nature of the network scaling will still kill you despite all this, and needs to be addressed at the protocol level ASAP.
The media transcoding components aren't written in Ruby, it shells out to ffmpeg or Imagemagick.