Hacker News new | ask | show | jobs
by seanhunter 813 days ago
The answer is obvious: they invented their own sharding solution because it's a really really cool problem to work on and they have more engineers than they really need to develop their actual product. A more resource-constrained team would have found a solution that sharded their backend using one of the existing solutions out there.

I have seen this several times before and it's always a symptom of having too many engineers working below the waterline. Rather than work on the actual customer-facing problem, let's port the backend to do event-sourcing/cqrs, move all our infrastructure to k8s, change language from x to y etc.

These are all what I would call "internal goals" (ie they may or may not be necessary or even essential to progress but are not directly customer-visible in their outcomes even if they may enable customer features to be built or indirectly improve the customer experience later) and need to be held to an extremely high level of scrutiny.

If you're amazon/google/meta and you need to do this because of extreme user scale I might believe you. If you're CERN or someone and you need to do this because of absolutely ridiculous data scale I might believe you. The idea that it's better for figma to write their own sharding solution than it is to port to one of the existing ones just doesn't pass even the most basic sniff test.

6 comments

I can buy your comment as an interesting and even credible hypothesis, but the absolutes which you deal in (“doesn’t pass even the most basic sniff test”) are damning. You are clearly lacking huge amounts of information and context and are passing your own assumptions as hard facts.

Also, I’m assuming Amazon or Google will sometimes roll their own solutions on problems of a scale in the same ballpark as Figma’s.

But anyhow, what’s the scale at which this becomes acceptable, exactly? Is there a magical number which serves as a universal threshold? Or is there - like in all engineering decisions - a very concrete economic case for which you and I both lack a lot of the requisite context and inputs?

In this particular case of sharding a postgresql solution, in my opinion, the parent is right. Any major cloud provider would give companies of their scale assistance. This is their bread and butter. The posts likely hide the requirement of stay on aws, but we don’t know they did not talk about that. Likewise cockroach or yugabyte were also available options.
I like the approach you took for questioning an unqualified claim.

Seems like a useful argument design pattern.

We went through something similar at Notion a few years ago and also chose to stick with RDS Postgres and build sharding logic in our application’s database client.

In both our case and Figma’s, sharding Postgres ASAP was of critical importance because of transaction ID wraparound threat or other capacity issues that promise hard days-long downtime. The kind of downtime that costs 10s of millions of dollars of brand damage alone. Possibly even company ending.

In such a situation, failure is not an option, and you must pick the least risky solution. Moving to an unmanaged cluster system and figuring out your own point-in-time backup/restore, access control provisioning, etc etc has a lot more unknown unknowns than sticking with the managed database vendor you know. The potential failure scenarios of Citus have scary worst cases - we get backup and restore wrong but it seems to work fine in test, then we move to Citus, then something breaks and we can’t restore from backup after all. It’s equally bad to mis-estimate the amount of time needed to bring up the new system. Let’s say you estimate 6 months to get parity with RDS built in features needed to survive disaster and start moving data over, but instead it takes 10 months. Is there enough time left to finish before going hard down? The clock is ticking. Staying with RDS keeps a whole class of new risk out of the picture.

At least here at Notion, NO ONE wanted to build something complicated for fun. We really wanted the company we’d spent years working for and on-call for to survive.

Our story: https://www.notion.so/blog/sharding-postgres-at-notion

Or you could just hire some set of people who know how to manage postgres? Seems easier than building an entirely new thing with its own set of bugs that are unknown unknown brand damage awaiting you.
It's not just manage Postgres, it's manage a Citus cluster - (unmanaged Postgres + postgres experts + time for them to implement their stuff) just gets us to parity with RDS but doesn't solve our sharding problem. We asked our Postgres consultants & networks to see if we could find Citus experts we could bring on full-time but didn't have great success. Most of the experts we talked to suggested application level sharding, and it seems like it worked out okay.
Absolutely, I am just saying that you are talking on now all the inconsistencies of a third party management system and building your solution on top of that; you don't get the infra savings and benefits of managing your own, you gain some velocity for now and as big name clients probably will be stable for a few years.

I had a problem just recently where I worked at a place that's using blue/green aws rds deployments with mysql replication, and binlogs cant be moved in that service.

This is something that is bog simple in a non-managed service, and as a result we can either manage app replication, re-sync data on each b/g upgrade, or do physical replication (slow). My point isn't that rds is bad, it's just that if you are already deciding to implement your own significant infrastructure on top of database it seems weird to me to not just have the knobs on the thing itself.

Though you could say the same is true of the storage, and tbqh most of the cloud storage is dogshit these days but we just deal with it.

It seems that this day the art of configuring a database has been long lost. I also completely don't understand the issue. Just buy two huge behemoth servers, put your postgres there in a replicated mode and move on. It'll sustain huge load. Surely those companies can afford to hire one sysadmin.
You can’t necessarily play Cookie Clicker with database hardware scaling and have a good time. Query performance and upkeep processes often begin to degrade well before a table reaches the maximum hardware-bound size. We were using an instance with 96 cores and 350gb of RAM which seems over provisioned on paper and still hitting a variety of issues like stalled out Postgres auto-vacuum.
The article suggests a different reason. What would be your approach if you wanted to stay on RDS?

> So, now, let me speculate. The real reason why Figma reinvented the wheel by creating their own custom solution for sharding might be as straightforward as this — Figma wanted to stay on RDS, and since Amazon had decided not to support the CitusData extension in the past, the Figma team had no choice but to develop their own sharding solution from scratch.

This rings a lot more true for me as well: a lot of the overly complicated decisions I've made haven't been because I wanted to try something interesting out (although occasionally it's been a factor), but more because I've ended up backed into a corner by previous decisions, factors outside my control, and limited time. Even when the simpler solution is obvious (which isn't always the case), it often takes a more complicated journey to get there. And balancing short term vs long term complexity is a challenge in its own right.
Wanting to stay on RDS is a reason doesn't survive the sort of extra scrutiny that I said should be applied in situations where you're doing a lot of work towards an internal goal. It also says in the article that they thought it was too risky to migrate (but somehow building their own sharding solution is going to be less risky for some reason).

I could of course be wrong but it really just feels to me like the reasons given in the article are attempts to justify a decision that was actually made because of "not invented here" syndrome.

Looks like you can’t think of a good reason to stay on RDS in this case, is that correct?
I can totally see why they want to stay on RDS, but think the other considerations should almost certainly outweigh that.

My main point is this decision makes no sense on its face[1]. Obviously I'm lacking the real context, so there may be overwhelming circumstances which mean that it was the right decision anyway, but these weren't explained in TFA for me. In TFA the reasoning was superficial, and this is the sort of decision that really should be held to a very high standard because as I say these types of internal goals have the potential to burn a ton of valuable engineering time on things which don't affect the customer-facing offering.

Now we have in a sibling thread someone from notion saying they did the same thing and for me exactly the same reasoning applies. It could be that all these different Saas companies are so special that them each building their own individual postgres sharding solutions to work around the fact that they can't get a sharded, managed postgres instance makes sense. Or not.

[1] That's what I mean by saying it doesn't pass the sniff test. It might actually be the right decision but your instincts should rebel against it because it feels very wrong. So there needs to be a serious examination before going down that path.

Fair. But it doesn't really explain why they wanted to stay on RDS. This is their reasoning:

> over the past few years, we’ve developed a lot of expertise on how to reliably and performantly run RDS Postgres in-house. While migrating, we would have had to rebuild our domain expertise from scratch.

So they had in house expertise to run performantly on RDS but that same experience couldn't be translated to switching over to it running on EC2 + Citus? Rather they used another non-experience concept of building their own sharding? That left me scratching my head.

I was puzzled by this as well. RDS is a managed, cloud product. You don't run it. The whole point is that AWS runs it for you, no?
It’s Postgres, large dbs will need some level of config and maintenance.

> Common DBA tasks for Amazon RDS for PostgreSQL

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appen...

Perhaps there were legal, compliance, or contractual constraints that made moving out of RDS impossible within their acceptable business risk levels?
I suppose Figma might just be beyond the "let's find the fastest/cheapest way to get this working" point. I believe it makes sense for a company in that stage to mess about a bit, find different (maaaybe even better) ways of doing things, keep the engineering work interesting to attract/retain talent, be OK with the inevitable waste involved in that game. If you're chasing the global maximum, you shouldn't get too obsessed with local maxima.

That said, I've seen plenty of unprofitable startups with high burn rate play this game. That seems a bit suicidal to me.

> I suppose Figma might just be beyond the "let's find the fastest/cheapest way to get this working" point.

The article implies otherwise. E.g. it quotes Figma saying: “Given our very aggressive growth rate, we had only months of runway remaining.”

Right, I was thinking of Figma in 2024 for some reason (they seem to have conquered the market, I'll just assume they're profitable with that pricing they have), the article talks about Figma in 2022, from what I gather. Should have read properly.
I don't have a read on this - do we know that Figma isn't doing difficult stuff that warrants proprietary solutions?
Absolutely not. Citus would have solved this problem. Or move to MySQL and use PlanetScale etc.

Second best option is ability to easily create prod environments, and then give those to your biggest customers (bigname.figma.com) etc. No single figma customer will go beyond an i3.metal for the DB, or the app.

So I just read the article - they were on RDS so Citus wasn't an option.

They also stated it was too risky to migrate data stores on the timeline they were working within

Those all seem like measured engineering decisions AFAICT

That doesn't sound right.

Do data dump from prod for initial sync and then setup replication from RDS to new cluster. Once synced do switch. Then you're off RDS and can shard on Citus.

a) they didn't want to move off RDS b) this is a pretty big hand wave over migrating your persistence store on a moving product and moving engineering teams

The coordination alone usually takes months

Yes, it takes months. They also spent months building this custom solution that they now need to maintain.
Yes but they probably wouldn't want to migrate off RDS.
This doesn't work with the constraint of "staying on AWS" though.
Yes it does.
I do agree here. The choice to prefer X over Y for hosting (no matter X and Y) often makes sense, changing hosting providers can take a bit of time & it's hard to fully assess the reality of the quality of support / security (again, not specifically writing about RDS or Citus, both are very good teams) beforehand, so it usually is safer to have a long probing period to move safely, which takes time, something they visibly didn't have much.
I just read through several thousand lines of code re-implementing the concept of a distributed queue from the ground up... for an application that has maybe a few hundred users. And doesn't need queues, at all.

This issue is so pervasive that we've all just assumed that it must be necessary.

I couldn't get the context of this response - is this application you read unrelated to the featured article?

I just read the article and from what I can tell the Figma team made a somewhat reasonable sounding decision

Yes, unrelated. My point was that wheel-reinvention is a curse of the software industry because it's just so easy to reinvent every wheel on a whim. DevOps and is no different. How many large orgs have their own build tooling, or some special sauce around large repos?
I don't read the story this way personally (not saying that these scenarios do not occur, but I feel the narrative detailed in the original article makes sense even without "chasing cool problems").