Hacker News new | ask | show | jobs
by davidroetzel 3545 days ago
The articles mentions the approach to have one schema per tenant but sadly does not mention this again afterwards.

Are there serious issues with this approach I am not aware of?

5 comments

Craig from Citus here. There are a number of issues with one schema per tenant, the biggest one is that at a larger scale Postgres will mostly just not work anymore. Things like pg_dump start to fall over and while this has been improved some, there is still an upper limit somewhere between 1,000 and 10,000 tenants. Further having to then run schema migrations against all of them can be quite painful. There are tools that help on the schema migration front, but what we've seen is that again at scale things start to break. If you'll only ever have 100 customers then by schema can work for you.
Performance, management, SLA are some big issues with any "SQL as a service" project. How can you garuntee 1) disk utilization per tenant 2) CPU utilization per tenant and 3) transaction volume per tenant? Most simply, I don't believe you can, and we concluded it's simply not worth going down this path of shared, multi-tenant database (postgresql not withstanding, any RDBMs would have the same issues).

Our conclusion was that the only way to get the required level of management per tenant, and to support truly massive number of tenants, was to use an inprocess database over https ie SQLite and Apache. But, SQLite has an image problem, it's everywhere, and nowhere. It's built with some fundamentally different decisions than other databases, and isn't traditionally used for web applications.

So that's the course we took, links in my profile for more info.

> use an inprocess database over https ie SQLite and Apache

Why not, but you introduced a whole bunch of new issues. SQLite only supports a single writer at a time. This is a problem if you have a lot of users on the same tenant. This is also a problem when you need to create an index, for example, which is not a background operation.

Indeed there are trade offs, and high write applications are a weak spot of this approach. But for low to medium write applications (ie most applications?), SQLite WAL [1] option performs really well. We implement application level caching via a X-Query-Cache header[2], in that case, you're serving directly from redis. This set up can scale really, really well.

[1] https://www.sqlite.org/wal.html

[2] https://www.lite-engine.com/docs.html#caching

One issue we had: pgdump performance was abysmal for dumping a single schema out of thousands.

We had to write a custom backup script to handle individual backups, taking advantage of the knowledge of our own database architecture (basically: we don't have to read the schema list and figure out relationships between them because we already know that).

If you read pgdump's source code, when doing the actual backup it uses postgres COPY command, so it was easy enough to write our custom exporter.

Also interested in this. I architected my system with a single multi-tenant db. My thinking was that it simplified things initially (less admin / backups / pooling etc) but it would be far easier to split into individual dbs later (than to combine).

We are a b2b product and we've picked up customers around the world. Now it seems like it might make sense to shard geographically. Is that common?

We shard geographically, with a single multi-tenant Postgres schema per geographical area.

You run into corner cases when a customer wants to operate into two distinct geographical areas, so basically you may have to maintain a central repository of tenants and, ultimately, under the hood, your tenant primary keys are not handled via local sequences.

I also have a database with schema per tenant. In the tenant code I set the schema. I like it and would like to see more information as well. One issue, I think, is that it does not have full isolation and no easy sharding. As sad by another commenter, I plan to distribute my schemas on multiple servers when neccessary, like the sharding approach.
My personal, recent experience is that my current startup's business model would be poorly served as a result.

You see I originally thought I was building a SaaS service, but it's actually turned out that my customers needed a two-sided network. A per-tenant schema would've been an painful impediment to making that paradigm shift.