Hacker News new | ask | show | jobs
by musk_micropenis 1294 days ago
30,000 users seems like a ludicrously small number of users to hit scaling problems. It sounds like Mastadon has not been designed for scale from the ground up, which is surprising for a project that hopes to be a popular social network.
10 comments

The number of users of an instance is not equal to the workload for the instance because of how federation works.

The frontend serves 30,000 users. The backend processes their posts alongside the posts of everyone that replies to them and all posts from people that the “home” users follow. So, while the home user base is 30,000, the effective load on the back end is much more, depending on how many followers/following a person has.

You can find posts from people who were hosting their own single user instance that went down because of a popular post that got federated across multiple instances.

> You can find posts from people who were hosting their own single user instance that went down because of a popular post that got federated across multiple instances.

I either understand something wring about it or that still points out to Mastodon being slow and badly engineered.

Like, the updates are per server not subscriber right ? That's at worst few thousand requests, all of them can be essentially served from RAM. Even if you get 10k responses to your comment, 10k responses within say 5 minutes is still only like 30-40 req/sec, that shouldn't be much even for Ruby

> You can find posts from people who were hosting their own single user instance that went down because of a popular post that got federated across multiple instances.

That doesn't give confidence to the typical self-hosting Mastodon user who goes 'viral' somehow. So they should expect that if they were to have a post to go viral across instances they have to become a sys-admin for the day to bring it back and scale it up to handle the traffic.

No wonder normal users are not self-hosting their own instances to fully own and self-verify themselves on Mastodon and have to search for an instance to re-centralize to.

That is very disappointing and not a great sell for Mastodon so-called 'verification', but not at all surprising.

So what I'm hearing is that the whole thing was designed as though any significant scaling was never really expected. That's not exactly encouraging for its future.
The largest Mastodon instance is mastodon.social, which had about ~900K users as of 11/22. This article, together with the three linked to near the top, are gold for understanding what experienced (but new-to-Mastodon) system architects may go through as they grow a Mastodon instance from an interesting experiment that lives in their bedroom to a system that reliably supports tens of thousands of users.
The largest mastodon instance is truth social
The CTO of gab.com did a great breakdown on the scaling issues of ActivityPub federation: https://www.youtube.com/watch?v=3kDtZ8MBWy8
Of course. I mean this doesn't come as a surprise, as others have already said, Mastodon is not designed to handle hundreds of thousands of users and scale up, hence why smaller volunteers are frequently running into issues with such a low amount of users.

Mastodon is still a solution pretending to search for a problem, already reminding us of the failure of federated social networks in the long term and eventually withering away and re-centralizing with larger instances.

Mastodon was designed to be a system of federated instances, not necessarily a few instances scaled up to hundreds of thousands of users each. More people should be hosting themselves or joining smaller instances.
Mastodon isn't designed for massive instances, people are supposed to find smaller ones instead of centralizing. If you "scale", it goes against what Mastodon is even for.
it has not been designed for scale because a single instance temporarily couldn't scale up fast enough?
No, because a single instance needed to "scale up" at a very small user count.
Trollish usernames aren't allowed on HN because they subtly troll every thread they post to. I've therefore banned this account. If you don't want it to stay banned, we can rename it for you, as long as it's to something non-trollish.

https://hn.algolia.com/?sort=byDate&dateRange=all&type=comme...

It really needed to replace a defective set of SSDs. They migrated because they were already planning to, but, with a replacement set of disks, it could continue serving more users for longer and migrate on a more relaxed schedule.
You’re free to share your expertise in its repo or put some MRs
This here is the benefit of Mastodon. If Twitter sucks, I can't help make it better but I am able to throw code at Mastodon (or insert to ActivityPub software of choice).
Sometimes (like in this case) the scaling problem is "obtaining hardware".

Hardware having been obtained (via Hetzner, instead of in Kris Nova's home lab), the instance has scaled.

Before the scaling problems were hit they were hosting in 4 very powerful machines with loads of RAM and all-SSD storage. I don't see any reasonable world in which those machines aren't enough to power 30,000 users.
The issue was the media storage. Mastodon stores lots of small files in deeply nested structures, so filesystems, especially networked filesystems, are very badly suited. Not to mention the disks themselves having issues.

The article quotes blocking requests for 10 to 20 seconds, this causes everything to go slower.

> Mastodon stores lots of small files in deeply nested structures

The deep nesting and overall layout of the cache drives me nuts. The storage layout is designed for the state of filesystems ca. late 90's, pre reiser or ext3, with the major deficiency of pretty much demanding an unnecessarily expensive object storage setup.

The nature of most of the space on an instance like this going to cache remote content is also that it ought to not try to make caching decisions in the app, but leave the actual caching to a proper cache, which would've been trivial if the paths of the cached objects was mapped to the media-proxy URLs it falls back to if the file has not already been downloaded. Instead a bunch of effort goes into building a cache setup that people often end up putting on expensive object storage when it's data that shouldn't even need redundancy.

Eh, we served way bigger things, with millions of small files via NFS just fine. Out of non-SSD storage too. Their issue, as written in article, was SSDs failing.

It was migration from one clustered filesystem to another and the temporary step was just Apache equivalent of try_files and NFS mount aside the new FS while the migration was running in the background.

If anything filesystems are better with some nesting vs "one big dir"...

I believe it was I/O bound, on old, failing SSDs. Even after migrating object storage out to DigitalOcean's S3-equivalent, Postgres couldn't keep up on the terrible disks.
Yeah and take a look at the hardware specs that was hosting this thing. And the site fell over at 30k users??? something is wrong with the software.
When you RTFA, you'll learn that the site didn't fall over because of the Mastodon software.
It’s very hard to scale when your persistent storage is failing and timing out.