Could someone comment on why exactly this a poor design for their backend? Genuinely curious, I don't have any real world context on systems like this.
IMHO, at scale SQL will breakdown. Even with sharding like Slack is able to per organization. It's why we have great things like Cassandra and DynamoDB. They're designed to solve replication in an easier way than replicating RDBMS iff you know your data access patterns in advance and they're not ad-hoc (which SQL is great at). This is the case for Slack. The typical way to solve RDBMS bottlenecks is to put a queue and messaging system in front of them. This breaks down when your services have bugs (my guess at what's happening).
is pretty good on why some NoSQL approaches are a step forward (perhaps not MongoDB at scale if consistency is necessary https://jepsen.io/analyses). In particular though:
There could be other issues about why Slack is slow. But at Slack scale, you need to be extremely heightened in your database strategy or you should follow the industry and use Cassandra/DynamoDB's built in partition tolerance. Key value stores scale horizontally much easier. B-trees don't scale as easily horizontally past a certain point.
Essentially, good NoSQL DBs have abstracted scale for you (so you don't have to think about it as much). But you have to know the access patterns in advance (the types of queries and updates you'll be running for most use cases), since you need to design your table around these access patterns. RDBMS leaks scaling from the abstraction (you need to use message queues, etc.).
It’s proven itself for scaling. Mostly startups don’t see the issues with SQL and don’t need to worry about it. At planet/top-Alexa-ranked-website scale though, you either use Spanner at Google, or use Cassandra at Apple, DynamoDB at Amazon, Cassandra at Instagram, parts of Facebook, Netflix, Manhattan at Twitter etc.
You keep using MySQL at Github and Slack if you want periodic downtime/degradation in service tho.
The core of Facebook, Twitter, YouTube, LinkedIn, all of Microsoft and lots more all run on MySQL and other similar RDBMS servers. It's a myth that you cannot scale MySQL.
I'm not sure about the others (e.g., there's nothing recent for YouTube I could find.. likely they'd use Spanner though for things like comments?).
I think you need a really talented database infra team if you're trying to use RDBMS for something like a real-time messaging store at scale. I can't say more about Amazon. But I just don't think it makes sense for these use cases to use RDBMS where real-time messaging (pull requests, comments, etc. for Github - messages, posts for Slack) is the 90% use case.
UPDATE. Maybe I'm wrong can you can use: https://vitess.io/ for horizontally scaling MySQL. I don't know enough about the details of it. But getting the data store right is so important to the overall backend's stability (I think it's no coincidence that Twitter stability became "solved" when they moved to something like Manhattan). And I just don't see why you wouldn't rewrite things using something that logically makes a lot more sense instead of trying to push connection pooling, query rewriting, etc. to the limit. They don't fundamentally solve what consistent hashing solves.
Discord understands to use this too.[1] Not to say you can't have your user database in SQL like Facebook, etc. But for messaging? And really for anything with high throughput / low latency, where you know the access patterns, just doesn't make sense to not use something with consistent hashing.
My original point - why I raised this to begin with - is I briefly browsed that Slack CTO's video. At one point he mentioned using RDBMS "because that's what we're experienced with." That's never a logical reason. It may be a practical one. But with time... it just doesn't stand up to ideas that are better and have proven themselves (e.g., consistent hashing). But again, using MongoDB the wrong way or assuming the "document store" is the main reason for using NoSQL can confuse people (it's one nice benefit for ad-hoc data models! but the big innovation in NoSQL is consistent hashing for the low-latency / high throughput use cases). And SQL has its benefits for certain use cases. But there's a better solve for the messaging storage at scale. I post this because: (a) I'm interested in others' opinions and feedback about how they've made RDBMS work (thanks) (b) tired of Github and Slack being down periodically, and for each mature SaaS to go through this learning curve (like with Twitter). Yo just use DynamoDB or Cassandra and save yourself the time/effort.
Don't have much to add, but want to say thanks for sharing those blog posts, they were interesting reads.
The slack CTO's comment about choosing RDBMS because 'familiarity' is interesting. IMO it's a gamble. I've seen it happen with my company when being a latecomer to containerization.
When it came to picking a container management tool, it was a tossup between k8s, Nomad, or just saying to hell with and running those containers ourselves on EC2 instances. Having run our stack on bare metal for year = we were really pretty good at it. There was a surprising amount of automation that could be ported over.
Eventually we picked k8s, and coincidentally, our usage grew more in 6 months than it had in the last 2ish years. So all in all, the gamble paid off.
... but I like to think there's another world where we picked the 'its familiar option' and things still worked out. If our traffic hadn't grown the way it did, we would never have felt the pain of having to manually scale out our systems - or basically write an in-house version of Kubernetes.
So in that sense, I'd guess that maybe some teams have the bad habit of playing the same side of the coin everytime. It may be prudent to stay conservative when picking a Datastore, maybe it's would be smart to pick a risky technology for your app servers? (and vice-versa)