Hacker News new | ask | show | jobs
by avianlyric 832 days ago
Their rationale for this choice is covered in the article somewhat extensively near the top.

> Additionally, over the past few years, we’ve developed a lot of expertise on how to reliably and performantly run RDS Postgres in-house. While migrating, we would have had to rebuild our domain expertise from scratch. Given our very aggressive growth rate, we had only months of runway remaining. De-risking an entirely new storage layer and completing an end-to-end-migration of our most business-critical use cases would have been extremely risky on the necessary timeline. We favored known low-risk solutions over potentially easier options with much higher uncertainty, where we had less control over the outcome.

TL;DR they were working in a short timeline, with a limited team size, and wanted to minimise any risks to the business.

Clearly cost is an issue for Figma, but downtime, or worse data loss, would have a ginormous impact on their business and potential future growth. If your product is already profitable, your user base growing fast, and with your ARR. Why would risk that growth and future ARR just to save a few $10Ks a month? A very low risk DB migration that lets you keep scaling and raking in more money, is way better than a high risk migration that might save some cash in the long term, but also risks killing your primary business if it goes wrong.

1 comments

Ok, what risk? Cockroachdb is already proven technology and costs marginally more (if you use their serverless setup, it's free until you hit real scale). At the startups I've been at that hit scale, scaling sql was always a massive undertaking and affected product development on every single time.

If you don't want downtime, don't use databases that require downtime to do a migration?

Netflix, roblox, every single online gambling website all use cockroachdb.

Sounds like their discomfort was in the migration path to 'any other database' alongside not having the experience with another database to mitigate any unknown unknowns.

> During our evaluation, we explored CockroachDB, TiDB, Spanner, and Vitess. However, switching to any of these alternative databases would have required a complex data migration to ensure consistency and reliability across two different database stores.

> ensure consistency and reliability across two different database stores.

This is main known known. And this is hard thing to attain.

My favorite story on that is testing of tendermint consensus implementation [1]. The testing process found a way to break the consensus and the reason was that protocol implementation and KV store controlled by protocol used different databases.

[1] https://jepsen.io/analyses/tendermint-0-10-2

Never used cockroach so pardon my ignorance, but are there no operational challenges with running/using them? Or are they the same challenges? And how compatible is it from an application developer perspective?
The managed service is hassle free and it's auto sharded so you don't have traditional scaling issues. You do need to think about how your index choices spread writes and reads on the cluster to avoid hotspots. It's almost completely compatible with postgres wire protocol but it doesn't support things like extensions for the most part.
There are TONS of operational issues running cockroach. At the last company I was at cockroach was probably over used as a magical way to run multiple DCs and keep things consistent without high developer overhead, but it was #1 source of large outages. So much so that we’d run a cockroach segmented out for a single microservice to limit the blast radius when it eventually failed.

That and its comically more expensive than Postgres, if you think IOPs are expensive wait till you see the service contract.

CRDB is Postgres compliant so the wire protocol and SQL syntax is all Postgres. It should be a 1 to 1.
Are all the corresponding latencies for every query one to one too?
In ye olden times I used to stop bosses from throwing away the slowest machine we had, and try to get at least one faster machine.

It’s still somewhat the case, but at the time the world was rotten with concurrent code that only worked because an implicit invariant (almost) always held. One that was enforced by the relative time or latency involved with two competing tasks. Get new motherboards or storage or memory and that invariant goes from failing only when the exact right packet loss happens, to failing every day, or hour, or minute.

Yes, it’s a bug, but it wasn’t on your radar and the system was trucking along yesterday and now everything is on fire.

The people who know this think the parent is a very interesting question. The people who don’t, tend to think it’s a non sequitur.

This is the important question.

We evaluated several horizontally scalable DBs and Cockroach was by far the slowest for our access patterns.

Except for the un-implemented features which they might need.

It also uses serializable isolation and in their implementation reads are blocked by writes unlike in Postgres. Those are both significant changes that can have far reaching application impacts