Hacker News new | ask | show | jobs
by notacoward 1550 days ago
It's not the size (though as mdoms explains it's more than you might think); it's the complexity of matching activities to segments at the necessary level of precision. I know what a consequential amount of data looks like, having worked on a storage system that routinely fed multiple petabytes of data to each of many analytics pipelines simultaneously. Geographic partitioning is only what gets you to millions of activities instead of billions. There's still a computational component involved in delivering those answers so quickly, and that's what I'm curious about. There might be some interesting algorithms involved. Your blithe dismissal of any part other than the one you think you understand shows that you either can't understand the whole thing or didn't bother to try.
2 comments

I doubt the actual leaderboard calculation requires full granular data down to the meter. Segments can be aggregated from the meter level data in advance. I’m sure there are clever choices in creating those aggregates but that seems like the obvious (and admittedly naive) approach.
> Segments can be aggregated from the meter level data in advance.

Not for a segment that doesn't exist. When I create a segment, it goes back over all nearby activities since the dawn of Strava time. That's the part where geographic sharding helps. Then, it has to check each one to see whether it traversed the segment without interruption or deviation - not within one meter but AFAICT within only a few. Does further sharding/tiling help there? Yes, and I'm sure that's part of how it works, but I'm interested in exactly how they apply those techniques and then solve all of the remaining problems - undoubtedly including those that neither you nor I have thought of. As you yourself say, there are probably some clever choices involved, and that's the part I'm curious about. It definitely seems like a "meaty" enough topic to fill out a blog post or two, and the world needs more such IMO.

Absolutely, I am also interested in the details. I'm sure it is very interesting and clever. But I expect the meat is in the ingestion, not in the leaderboard calculation. To provide a new leaderboard so quickly the work must have been done up front.
I spent months working on just getting something that could match GPS logs to "segments" let alone something thats fast and scalable. Geo problems are so much harder than people assume.