| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tmsh 2080 days ago

Just doesn't seem like it makes sense for any type of long-term messaging/posts/comments store.

I think you're mistaken about Twitter:

"Manhattan(the backend for Tweets, Direct Messages, Twitter accounts, and more)"

https://blog.twitter.com/engineering/en_us/topics/infrastruc...

I'm not sure about the others (e.g., there's nothing recent for YouTube I could find.. likely they'd use Spanner though for things like comments?).

I think you need a really talented database infra team if you're trying to use RDBMS for something like a real-time messaging store at scale. I can't say more about Amazon. But I just don't think it makes sense for these use cases to use RDBMS where real-time messaging (pull requests, comments, etc. for Github - messages, posts for Slack) is the 90% use case.

UPDATE. Maybe I'm wrong can you can use: https://vitess.io/ for horizontally scaling MySQL. I don't know enough about the details of it. But getting the data store right is so important to the overall backend's stability (I think it's no coincidence that Twitter stability became "solved" when they moved to something like Manhattan). And I just don't see why you wouldn't rewrite things using something that logically makes a lot more sense instead of trying to push connection pooling, query rewriting, etc. to the limit. They don't fundamentally solve what consistent hashing solves.

EDIT. Also, wrong about LinkedIn re: messaging:

https://engineering.linkedin.com/blog/2020/bootstrapping-our...

https://en.wikipedia.org/wiki/Voldemort_(distributed_data_st...

https://en.wikipedia.org/wiki/Consistent_hashing

Discord understands to use this too.[1] Not to say you can't have your user database in SQL like Facebook, etc. But for messaging? And really for anything with high throughput / low latency, where you know the access patterns, just doesn't make sense to not use something with consistent hashing.

But as mentioned in Rick's AWS videos about https://en.wikipedia.org/wiki/Technology_adoption_life_cycle - there's a lot of late majority, laggards, etc.

My original point - why I raised this to begin with - is I briefly browsed that Slack CTO's video. At one point he mentioned using RDBMS "because that's what we're experienced with." That's never a logical reason. It may be a practical one. But with time... it just doesn't stand up to ideas that are better and have proven themselves (e.g., consistent hashing). But again, using MongoDB the wrong way or assuming the "document store" is the main reason for using NoSQL can confuse people (it's one nice benefit for ad-hoc data models! but the big innovation in NoSQL is consistent hashing for the low-latency / high throughput use cases). And SQL has its benefits for certain use cases. But there's a better solve for the messaging storage at scale. I post this because: (a) I'm interested in others' opinions and feedback about how they've made RDBMS work (thanks) (b) tired of Github and Slack being down periodically, and for each mature SaaS to go through this learning curve (like with Twitter). Yo just use DynamoDB or Cassandra and save yourself the time/effort.

[1] https://blog.discord.com/how-discord-stores-billions-of-mess...

1 comments

jsmith12673 2076 days ago

Don't have much to add, but want to say thanks for sharing those blog posts, they were interesting reads.

The slack CTO's comment about choosing RDBMS because 'familiarity' is interesting. IMO it's a gamble. I've seen it happen with my company when being a latecomer to containerization.

When it came to picking a container management tool, it was a tossup between k8s, Nomad, or just saying to hell with and running those containers ourselves on EC2 instances. Having run our stack on bare metal for year = we were really pretty good at it. There was a surprising amount of automation that could be ported over.

Eventually we picked k8s, and coincidentally, our usage grew more in 6 months than it had in the last 2ish years. So all in all, the gamble paid off.

... but I like to think there's another world where we picked the 'its familiar option' and things still worked out. If our traffic hadn't grown the way it did, we would never have felt the pain of having to manually scale out our systems - or basically write an in-house version of Kubernetes.

So in that sense, I'd guess that maybe some teams have the bad habit of playing the same side of the coin everytime. It may be prudent to stay conservative when picking a Datastore, maybe it's would be smart to pick a risky technology for your app servers? (and vice-versa)