Hacker News new | ask | show | jobs
by ahachete 9 days ago
I have mentioned this before, but here it goes again:

I'm really happy that there's more options for Postgres sharding and I applaud Pgdog and the team's efforts and energy.

Having said that, this makes it a no-go for me:

> shard_number = hash(data) % num_shards

https://docs.pgdog.dev/features/sharding/basics/#terminology

Most sharding solutions distribute the hash value over linear ranges, that then split across "virtual shards", that are then placed on the physical shards or worker. This allows for shard replacement when needed. For example, Citus works this way, and even adds convenience functions for shard migration (using logical migration) in an automated way. That's all I'd need.

Operationally, it's worlds apart. With modulo distribution the only way to replace data is to reshard everything --something you don't want to do however fast the operation may be.

1 comments

Yeah good callout. We'll add rendezvous soon enough. Until then, being compatible with Postgres partitions has been advantageous -- while we build everything out, people were able to migrate to PgDog for the query routing layer while doing the resharding in Postgres.

Adding a sharding function in our architecture is relatively straightforward. We also support plugins which can control the flow (and direction) for queries, so our users can add their own (and they do!).

TBH I don't think it's that straightforward, I see it more of a notable architectural change. At a very high level, this means:

* Adding a sharding function, as you say.

* Developing an external service for metadata (shard placement) or alternatively have that metadata in one place and replicate (consistently!) to every query router.

* Implementing functions/catalogs for the users to understand the placement and configure/alter it.

* Implementing shard migration / rebalancing capabilities, possibly using Postgres logical replication (plus notable automation).

Here's one idea if you follow this path, something that Citus doesn't have: make the sharding function pluggable and pick one by default which is well-known and available in many languages (e.g. xxhash). If you do so, and guarantee stability of those functions, they could be used externally (applications) to route queries / inserts especially to the appropriate shard. While it makes application more complex, it may allow (combined with access to the metadata service) for faster ingestion paths (this is often known as application assisted sharding), and its not exclusive of the query routers.

Edit: formatting