| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jamesrom 2894 days ago

Using a Hilbert Curve for indexing is a pretty standard approach for geospatial queries. Nothing new here...

However, using a Hilbert Curve for sharding doesn't seem like the best approach. You can partition by anything you like, it doesn't have to be arbitrary points along your index. Using 1-dimension to shard 2D data isn't optimal.

For example, construct a heatmap of your 'load score' and shard based on that, in two dimensions. Then use an S2 curve to index inside that shard.

3 comments

mbid 2894 days ago

However, using a Hilbert Curve for sharding doesn't seem like the best approach.

Yes, that's also what I thought. Searching for "same size k-means" yields a simple postprocessing step to even out the clusters produced by the usual k-means algorithm.

EDIT: k-means is adapted directly here: https://elki-project.github.io/tutorial/same-size_k_means

link

n4r9 2893 days ago

In my experience the linked algorithm behaves quite poorly when the dataset and number of desired clusters becomes large. I think the issue is that it only allows for pairwise switching between clusters, and this ends up with a lot of points still being assigned to clearly the wrong cluster. Some day I would like to try more complex neighbourhood searches with it and see if it helps.

link

geophile 2894 days ago

> using a Hilbert Curve for sharding doesn't seem like the best approach. You can partition by anything you like, it doesn't have to be arbitrary points along your index. Using 1-dimension to shard 2D data isn't optimal.

If you want to shard by proximity (items close in space are likely to be in the same shard, then the transform to 1d is the way to go, why wouldn't it be? What is your definition of "optimal"?

Sharding by proximity or by something else depends on the relative frequency of queries by location or something else. If you shard by location, then a query to one location goes (ususally) to one shard. That should scale better. Otherwise, each location-based query goes to each shard.

link

realfun 2894 days ago

Thanks for the suggestion and there are definitely room for optimization.

In the original design we did consider approach similar to the "heatmap approach" you mentioned, it would reduce the shard movement for people who commutes in the city.

Later we figured that simply sharding along on the Hilbert Curve already give what we need, and the shard move is unavoidable, we would explain more details around the shard move in future blog.

link