Hacker News new | ask | show | jobs
by EGreg 2896 days ago
Let me share my experience building location based search at Qbix. All of it is available now for free in the Places plugin:

https://github.com/Qbix/Platform/tree/master/platform/plugin...

First of all, we normally do our sharding by the hash of the primary key. Typically it is the publisherId and hash of the name of a “stream”, which is our general dynamic data structure. What this does is essentially distribute the load evenly between hashes without having to worry about the “water” stuff mentioned above.

Ok now about location based search...

1. We use GeoHash for addressing the surface of the Earth. For lookups, we calculate actual corners of the lat-lng rectangle in Haversine distance and then translate it to geohash.

2. Technically we could have just used a database index at this point and be done with it. But we want to handle both PULL requests and PUSH notifications when something happens nearby.

3. So given a radius the subscriber is interested in, we set up 4 streams for SUBSCRIBERS that together (as rectangles) cover the circle surrounding the location and radius.

4. Then we also set up streams for PUBLISHERS which correspond to all the various drop-down values we have for the radius. This is compicated by the fact that we have one set for metric system and another for imperial. But we just make one stream for each of them. The more radii available the more streams you have to post to (like 10-20) but posting doesn’t happen that often and it’s better to precompute this stuff.

5. Relations between streams are one of the many features that streams do out of the box. So basically the user can just subscribe to a certain location and radius, interest, etc. and get either realtime updates or offline notifications. The item being posted is itself a stream and what’s being inserted is relations between it and the category stream the person is interested in.

Thus you can also see right away who is interested in what. In the future however we plan to encrypt the data end to end so the servers won’t know this info. Then the location search will be harder. (qbix.com/blog)

1 comments

Thanks for sharing EGreg.

Elasticsearch has geo-indexing as well(based on geohash internally), and by default it does id hashing similar to what you said(murmurhash3), we actually leverages that for location based searches.

The challenge addressed in the blog is not in how to search/address(as said Elasticsearch handles it already), it is about how to distribute the load so calculation only happens on limited nodes, and reduce the index size so it can be more performant.

Ah, the goal makes sense. I would suggest that it’s not so bad to have a controller node fan-out and fan-in queries, as long as the database can handle many concurrent queries. Essentially you’re distributing the work evenly across nodes but you don’t have affinity for a particular node. Yes, there is more latency (it is as slow as the slowest connection) but it is endlessly scalable. But, I am sure I missed some benefits from localizing calculations to only a node or two.

In the scheme above, by the way, it DOES localize searches on one shard. Essentially all relations to a stream are on the same shard as the stream. And each center+radius has one associated stream and therefore the search takes place on one shard.