| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fatbird 2034 days ago

Not the way Mongo does it.

The way sharding works in general is that your data gets an additional key (unless it has one that works for sharding already), and a router in the stack does traffic control on the query to send it to the correct shard; on the way back, the data is reassembled into a single result. This enables parallelism in your query, boosting performance by using something like map-reduce. Secondarily, your shard management layer can do a lot with shard-level redundancy and dynamic sharding to spread the data evenly across shards.

The scaling axis here is the size of your data--in mongo's case, the number of documents in a collection. As the number of documents grows, you shard the collection to keep queries on it fast. Mongos, the sharding router, only manages sharding with a sharding key on a document, so the only sharding possible is spreading documents from a single collection across multiple shards. If you have 10 databases that each have 100 collections, you get that on every shard, but each collection only has a subset of the collections docs.

I haven't looked deeply into the mechanism, but I imagine this floats on top of mongo replication, where the replication layer cooperates with the sharding manager to replicate only the shard's docs (as Mongo replicates by tailing the oplog, all a shard has to do is ignore oplog entries for docs without a shard key in the correct range).

It's the fact that it works at the collection level that makes it useless for us. Each database, for single entity, is a set timespan of timestream data, with one collection per timestream. Entities vary by the number of timestreams/collections they have, but as they're all the same type of entity, their maximum size is pretty consistent and we don't have problems querying a single collection. We don't need collection-level query performance.

Our axis of scaling is the number of databases, not the number of documents within a collection. What would be literally perfect for us is a sharding manager that routes by database. We could put each entity's database on its own mongo instance/cluster, or an instance holding X databases, and scale horizontally almost indefinitely. We're lucky in that per-entity data falls into a clear range of sizes; we're unlucky in that no such router exists for such a common scenario, which is bizarre to me.