Hacker News new | ask | show | jobs
by mulmen 1563 days ago
Millions of activities is approximately zero activities. That's an inconsequential amount of data.

Even if it is a couple orders of magnitude larger than I think geographic partitioning can keep the volume small enough to easily fit in RAM.

2 comments

It's not the size (though as mdoms explains it's more than you might think); it's the complexity of matching activities to segments at the necessary level of precision. I know what a consequential amount of data looks like, having worked on a storage system that routinely fed multiple petabytes of data to each of many analytics pipelines simultaneously. Geographic partitioning is only what gets you to millions of activities instead of billions. There's still a computational component involved in delivering those answers so quickly, and that's what I'm curious about. There might be some interesting algorithms involved. Your blithe dismissal of any part other than the one you think you understand shows that you either can't understand the whole thing or didn't bother to try.
I doubt the actual leaderboard calculation requires full granular data down to the meter. Segments can be aggregated from the meter level data in advance. I’m sure there are clever choices in creating those aggregates but that seems like the obvious (and admittedly naive) approach.
> Segments can be aggregated from the meter level data in advance.

Not for a segment that doesn't exist. When I create a segment, it goes back over all nearby activities since the dawn of Strava time. That's the part where geographic sharding helps. Then, it has to check each one to see whether it traversed the segment without interruption or deviation - not within one meter but AFAICT within only a few. Does further sharding/tiling help there? Yes, and I'm sure that's part of how it works, but I'm interested in exactly how they apply those techniques and then solve all of the remaining problems - undoubtedly including those that neither you nor I have thought of. As you yourself say, there are probably some clever choices involved, and that's the part I'm curious about. It definitely seems like a "meaty" enough topic to fill out a blog post or two, and the world needs more such IMO.

Absolutely, I am also interested in the details. I'm sure it is very interesting and clever. But I expect the meat is in the ingestion, not in the leaderboard calculation. To provide a new leaderboard so quickly the work must have been done up front.
I spent months working on just getting something that could match GPS logs to "segments" let alone something thats fast and scalable. Geo problems are so much harder than people assume.
Incredibly ignorant statement. An activity can have thousands of data points across dozens of metrics. A record of your route to 1 meter or smaller precision for a 5 hour bike ride is alone thousands of data points. Along with hundreds or thousands of time series points for heart rate, elevation, cadence, power, respiration, etc etc... Millions of activities means tracking tens of billions of data points and perhaps an order of magnitude more than that.
There’s no reason all that granular data would be used to calculate a leaderboard. It can be rolled up in advance and for leaderboard purposes most of it can be ignored.

Parallelism along the location, customer and activity dimensions makes this an easily reducible problem.