Hacker News new | ask | show | jobs
by CommanderData 1294 days ago
Quite beefy hardware for on-prem. Perhaps someone could explain to me why 30k users, even assuming concurrent users would be an issue for hardware that size?

Is the app stack naturally resource heavy or is this setup particular different to how a instance should be?

4 comments

There are two problems that come up with scaling-

1. Those 30k users who are following users on other servers, which pulls in the content of those users. I'm on hachyderm but I would guess only about 20% of the people I follow are. That means my user is pulling in 250 other users data into the system. Of course most, if not all, of the people are follow are being followed by someone else so it's not a pure multiplier. At the same time it does mean it's a lot more data moving in.

2. NFS, which is where they had the problems, was being used as a media store. People on mastodon and twitter like sharing images and other media. Even people who run single user nodes but follow a lot of people end up using a ton of storage space. 30k people scrolling through timelines and actively pulling that data out, while queues are pushing data in, can be tough to scale. Switching to an object store really helped fix that.

On top of that the mastodon app is very very sidekiq heavy. For those not familiar with ruby, sidekiq is basically a queue workload system (similar to python's celery). You scale up by having more queue workers running. The problem with NFS is that all of those queues are sharing the filesystem, making the filesystem a point of failure and scaling pain. Adding more queue workers makes the problem worse by adding to the filesystem load, rather than resolving the problems. Switching to an object store helps until the next centralized service (in this case postgres) reaches its limits.

So basically the 30k users each following their own set of users creates a multiplier on how many users the instance is actually working with. The more users on either side of the equation the more work that needs to be done. If this was a 30k user forum where every user existed on the instance the load would be significantly less.

> The problem with NFS is that all of those queues are sharing the filesystem, making the filesystem a point of failure and scaling pain.

It is not NFS that is a SPOF, it is a single NFS server that is a SPOF. There exists distributed NFS systems (OneFS, Panasas) that can tolerate the loss of up to N servers before the service gets disrupted.

I suspect distributed NFS won't help here when the problem is that a server gets slow / overloaded. In particular, this setup was actually I/O bound and there wasn't more I/O to be had.
> In particular, this setup was actually I/O bound and there wasn't more I/O to be had.

No more I/O to be had from that particular NFS server. If your mounts are distributed over 4+ servers, then you have potentially 4x the available operations available.

>Perhaps someone could explain to me why 30k users, even assuming concurrent users would be an issue for hardware that size?

the main problem was slow io because of faulty disks which brought everything to a crawl.

Ruby On Rails... On top of that, the federation part is basically ingestion of all the federated servers across the world, so Postgresql would see a constant write despite the users of the instance aren't posting anything
I'm absolutely amazed that people are going with federation of this form with this massive overhead instead of something like RSS, but beefed up to a standard, where you can just point your client to multiple web servers and have them use your server as identity provider and that's it.

Imagine your subscriptions and account living on one server, then when you log in that server gives you the list and your client goes and gets all the data.

We already had this sort of federation figured out, it's the open web. We just have to find a way to get the open web to provide the things that google, facebook, reddit, provided.

Easy way to contribute content.

Discovery for new content.

Search ability.

Kill the things centralized websites provided, let people host websites within that system, and let the clients handle dealing with the fact that there are all sorts of providers out there.

> that server gives you the list and your client goes and gets all the data.

That would suck on a mobile device with limited bandwidth and/or data. Also lots of repetitive fetching when people have large number of followers too, no?

> That would suck on a mobile device with limited bandwidth and/or data.

Why would it be any different than using a mobile feed reader to follow hundreds of blogs, or using a podcast app to follow hundreds of podcasts? Both are commonly done on mobile all the time. Checking if a feed has been modified since the last time the client checked costs virtually nothing.

bioemerl: I've been casually digging into Mastodon and ActivityPub for the last few weeks and FWIW I think you're absolutely right — with the caveat that I may be missing something obvious, it's seems very dumb for Mastodon/ActivityPub servers to be downloading and delivering content on behalf of client apps.

> Why would it be any different than using a mobile feed reader to follow hundreds of blogs

In my experience, you generally do that through an aggregator which has already cached the articles for you and you're doing a bulk fetch from a single host - that is quite different to calling out to hundreds of individual hosts and fetching a page from each.

> it's seems very dumb for Mastodon/ActivityPub servers to be downloading and delivering content on behalf of client apps.

Isn't that literally what an RSS aggregator does?

> In my experience, you generally do that through an aggregator…

My understanding is that a cloud-based aggregator (like Feedly) delivers feed and state information to clients, but not the content itself.

To test this with blogs, I did a new install of Reeder, synced with Feedly, then turned off Wi-Fi. In my subscriptions, I got everything that would be in the feeds themselves (notably item titles and descriptions as created by publishers) but nothing beyond that. The offline experience was mostly useless, suggesting that the client does most of the heavy-lifting even when leveraging cloud-based aggregators.

So is making a few REST API calls a significant savings over checking a couple hundred (or whatever) RSS feeds? With "If-Modified-Since" checks being so cheap, I'm not sure that inserting Mastodon instances as middleboxen makes sense. If all Mastodon did was store subscriptions and state info, it seems like we'd have a far more resilient microblogging ecosystem.

Is it really that big of a deal? We've advanced like 15 years since we were able to handle Twitter on our devices, why can't we handle many feeds from multiple servers?
That was solved with RSS readers.

It would be akin to having single-user mastodon instance

Rss seems like it's a pretty bad standard that doesn't let you do a ton with it and doesn't solve problems of things like discoverability or your ability to post content.

It's great for reading from lots of new sites at the same time, but it's clear that it hasn't taken off and when that happens it's normally for pretty good reason

Think they are referring to way it works, not technical details of implementation, as it isn't really "social network" with no ability to have any social action

As in you can subscribe to who you want and no moderator will stop you and then you can... just comment on a blogpost or article, with no 3rd party aside from this particular space moderators

> That was solved with RSS readers.

Wasn't it solved by moving to RSS aggregators and doing a single bulk fetch of your cached N blogs rather than calling out individually to N blogs?

That works best when you have few clients and lots of “sources” but mastadon is apparently designed for few sources and lots of clients - in which case updating the “server” once could be more performant.
I hate to be the person but I've seen complicated dynamic applications push much higher bandwidth and serve millions of concurrent users with similar if not smaller h/w requirements.

Would be interesting seeing Twitters complete backend and while mastodon might not be apples to apples also interested cost per user to infrastructure analysis too.

Big difference being that Twitter uses Java, Scala, etc. Twitter used to use RoR also and it went down literally every day. I'm talking 2012 or so I think, bad memory haver here.
Twitters primary problem was that they had not build a system that was designed to shard, not Rails. They'd have needed a rewrite no matter which framework they'd started with.

I have no love for Rails, but blaming it for Twitters old problems is not fair.

That said, Mastodon has much of the same problem, and is only "saved" by the combination of federation and ten years of hardware advances. Thankfully, the federation means there's plenty of opportunity for people to experiment with other implementations of ActivityPub (or even implementations of the full Mastodon API), or fixes to it.

Rails doesn't scale.
And yet it runs sites magnitudes larger than Hachyderm just fine.
Looking at the amount of requests my 5 user (following ~60 people in total), 90% idle Pleroma instance handles is a bit mad - 60-90 requests a minute.

I dread to think how much a busy 30k user instance does.

Pleroma also isn't RoR and is really designed to scale out better
> Pleroma also isn't RoR and is really designed to scale out better

Sure ... but that's not what I was talking about. It's still talking ActivityPub and that seems, to me, from my low use, few user instance, to be an extremely chatty protocol. I don't know if the ActivityPub traffic scales linearly with users but it would be a not inconsiderable number for 30k users.

Exactly. We're agreeing! That's why I'm not surprised that the mastodon server didn't do so well and that you might get more mileage out of your compute
My understanding is that Pleroma was first and foremost designed to have a small memory footprint for small instances. As an Elixir program running on the BEAM (Erlang) runtime it ought to scale a lot better, but serious work had to be put into that. Not sure it's running on any sites as big as the big Mastodon ones.
[1] has the biggest Pleroma at about 27k users - about 45th biggest according to the list on [2]

[1] https://the-federation.info/pleroma [2] https://instances.social

Bad disks.
And for a while, NFS on those bad disks.