Hacker News new | ask | show | jobs
by tedivm 1295 days ago
There are two problems that come up with scaling-

1. Those 30k users who are following users on other servers, which pulls in the content of those users. I'm on hachyderm but I would guess only about 20% of the people I follow are. That means my user is pulling in 250 other users data into the system. Of course most, if not all, of the people are follow are being followed by someone else so it's not a pure multiplier. At the same time it does mean it's a lot more data moving in.

2. NFS, which is where they had the problems, was being used as a media store. People on mastodon and twitter like sharing images and other media. Even people who run single user nodes but follow a lot of people end up using a ton of storage space. 30k people scrolling through timelines and actively pulling that data out, while queues are pushing data in, can be tough to scale. Switching to an object store really helped fix that.

On top of that the mastodon app is very very sidekiq heavy. For those not familiar with ruby, sidekiq is basically a queue workload system (similar to python's celery). You scale up by having more queue workers running. The problem with NFS is that all of those queues are sharing the filesystem, making the filesystem a point of failure and scaling pain. Adding more queue workers makes the problem worse by adding to the filesystem load, rather than resolving the problems. Switching to an object store helps until the next centralized service (in this case postgres) reaches its limits.

So basically the 30k users each following their own set of users creates a multiplier on how many users the instance is actually working with. The more users on either side of the equation the more work that needs to be done. If this was a 30k user forum where every user existed on the instance the load would be significantly less.

1 comments

> The problem with NFS is that all of those queues are sharing the filesystem, making the filesystem a point of failure and scaling pain.

It is not NFS that is a SPOF, it is a single NFS server that is a SPOF. There exists distributed NFS systems (OneFS, Panasas) that can tolerate the loss of up to N servers before the service gets disrupted.

I suspect distributed NFS won't help here when the problem is that a server gets slow / overloaded. In particular, this setup was actually I/O bound and there wasn't more I/O to be had.
> In particular, this setup was actually I/O bound and there wasn't more I/O to be had.

No more I/O to be had from that particular NFS server. If your mounts are distributed over 4+ servers, then you have potentially 4x the available operations available.