| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mslot 1068 days ago

> What's advantage over having tenant id as distribution column?

Ease of use is definitely the main one. If you're willing to put in the work required to use tenant ID as a distribution column (add it to tables, primary keys, foreign keys, filters, joins), then it's a more scalable approach.

A challenge with sharding by tenant ID is that many applications use a normalized data model, meaning not all tables obviously have a tenant ID column. When you use a schema per tenant on vanilla PostgreSQL, no additional steps are typically required to enable schema-based sharding.

There are some other benefits of schema-based sharding such as custom table definitions, simpler & more versatile access control, and longer term we expect it will be easier to pin a large tenant to a node using schemas, or distribute the tables of ultra-large tenants (by some other dimension). Of course, row-based sharding has other benefits like parallel cross-tenant queries and global DDL.

> Couldn't the microservice case be handled by having distributed tables with no distribution column?

Absolutely. We first implemented the notion of single shard distributed tables with a NULL distribution column, and then built schema-based sharding on top as a convenience layer.

> Can one have a reference schema which can be efficiently used alongside every other schema? Guess that's public schema with create_reference_table/create_distributed_function

Yes, public schema (or other another non-distributed schema) can have reference tables & distributed tables as usual, and tables in distributed schemas can have foreign keys / local joins with reference tables.