Hacker News new | ask | show | jobs
by JohnBooty 963 days ago
I've never done geographic sharding but it seems kind of hard. How do you pick shard boundaries? How do you deal with entities who are near the boundaries and whose current operational data therefore spans >1 shards? (Imagine somebody at near the geographic intersection of like, five shards looking for pizza in a 10 miles radius or w/e)

Also the majority of entities they're tracking (users, drivers) do not have fixed locations.

Maybe it's not as hard as I'm thinking. I guess you just have to accept that any query can span an arbitrary number of shards and the results need to be union'd.

I'm sure a lot of smart people have tackled this at the DoorDashes and Ubers of the world and maybe there's some optimal way of handling it. I would love to hear about that.

3 comments

Great points that show regional databases are not obviously simpler than one global database, especially from the application developer's perspective.

> How do you deal with entities who are near the boundaries and whose current operational data therefore spans >1 shards? (Imagine somebody at near the geographic intersection of like, five shards looking for pizza in a 10 miles radius or w/e)

Hitting 5 shards might not be that bad. I think you could divide the world into sufficiently large hexagonal tiles; you'd hit at most three shards then. Maybe each fixed-size tile is a logically separate database. Some would be much hotter than others, so you don't want to like back each by a fixed-size traditional DBMS or something; that'd be pretty wasteful.

> Also the majority of entities they're tracking (users, drivers) do not have fixed locations.

Yeah, you at least want a global namespace for users with consistent writes. The same email address belonging to different people in different regions is unacceptable. In theory the global data here could just be a rarely-updated breadcrumb pointing to which database holds the "real" data for that user. [1] So you can make the global database be small in terms of data size, rarely-written, mostly read in eventually consistent fashion, and not needed for geographically targeted queries. That could be worthwhile. YMMV.

[1] iirc Spanner and Megastore have a feature called (something like) "global homing" that is somewhat similar to this. For a given entity, the bulk of the data is stored in some subset of the database's replicas, and bread crumbs indicate which. If you get a stale bread crumb, you follow the trail, so looking up bread crumbs with eventually consistent reads is fine. [edit to add a bit more context:] One major use case for this is Gmail. It has tons of regions in total, but replicating each user's data to more than 2 full replicas + 1 witness would be absurdly wasteful.

[edit:] looks like CockroachDB has the concept of a per-row preferred region, which might also be vaguely similar. <https://www.cockroachlabs.com/docs/v23.1/table-localities#re...> I haven't used CockroachDB and only skimmed this doc section.

I think the global homing feature you are describing belongs to applications built atop Megastore (and Spanner). That is, Megastore itself is agnostic to it, but the application knows how to resolve the homing.
You could be right. My memory's fuzzy on this, and I don't see this feature in a casual skim of the Cloud Spanner docs.
Thanks for a thoughtful and informative reply, I learned some things.
> I've never done geographic sharding but it seems kind of hard. How do you pick shard boundaries? How do you deal with entities who are near the boundaries and whose current operational data therefore spans >1 shards? (Imagine somebody at near the geographic intersection of like, five shards looking for pizza in a 10 miles radius or w/e)

You could do it by market (eg. SFBA, Los Angeles, San Diego) or by state.

They would have to have many shards per city to keep up with the level of write traffic though. And what happens when a user from SFBA goes down to LA?
Would they?

I mean, I've seen conventional SQL databases handle ten million orders per hour on a single host. I find it hard to believe DoorDash is processing more than ten million orders per hour, even in a large city.

I suppose they might exceed what a single host can handle if they're, I don't know, recording every driver's location once per second?

    I suppose they might exceed what a single host 
    can handle if they're, I don't know, recording 
    every driver's location once per second?
Even then, that's not that much data. You only need to retain the current location of the driver and you can aggressively prune data more than N seconds old.

A quick Google suggests there are 2M Doordash drivers, but I'll assume that's "all drivers who have ever signed up for DoorDash, ever" and the number of DoorDash drivers actually working at any given moment is a small fraction of that.

If we assume that a max of 100,000 drivers are working at any moment, and a slightly more relaxed location update interval of 10 seconds, that's 10K updates per second which is not exactly super high performance stuff. Of course, tracking driver location is just once piece of their operations.

You could probably just do it by continent as no one from SA would order anything from Europe but at that point each of those databases is probably big enough that you would need a similar solution as if you just had a single one

    You could probably just do it by continent as no 
    one from SA would order anything from Europe
Imagine a luxury version of DoorDash that does work this way. As I awake in my luxury palace in Sao Paulo, I realize that I would like some fresh grapes from the champagne region of France. With a few taps on my Luxury Door Dash app, a plane is on its way with my grapes.