|
|
|
|
|
by jitl
815 days ago
|
|
We went through something similar at Notion a few years ago and also chose to stick with RDS Postgres and build sharding logic in our application’s database client. In both our case and Figma’s, sharding Postgres ASAP was of critical importance because of transaction ID wraparound threat or other capacity issues that promise hard days-long downtime. The kind of downtime that costs 10s of millions of dollars of brand damage alone. Possibly even company ending. In such a situation, failure is not an option, and you must pick the least risky solution. Moving to an unmanaged cluster system and figuring out your own point-in-time backup/restore, access control provisioning, etc etc has a lot more unknown unknowns than sticking with the managed database vendor you know. The potential failure scenarios of Citus have scary worst cases - we get backup and restore wrong but it seems to work fine in test, then we move to Citus, then something breaks and we can’t restore from backup after all. It’s equally bad to mis-estimate the amount of time needed to bring up the new system. Let’s say you estimate 6 months to get parity with RDS built in features needed to survive disaster and start moving data over, but instead it takes 10 months. Is there enough time left to finish before going hard down? The clock is ticking. Staying with RDS keeps a whole class of new risk out of the picture. At least here at Notion, NO ONE wanted to build something complicated for fun. We really wanted the company we’d spent years working for and on-call for to survive. Our story: https://www.notion.so/blog/sharding-postgres-at-notion |
|