Hacker News new | ask | show | jobs
by notacoward 1550 days ago
The weird thing is that they claim it's to avoid duplicate activities, but they totally know how to recognize duplicates already. Every so often, there's some sort of glitch in one of my activities getting from Garmin to Strava. If I'm feeling impatient, I download the .fit file from Garmin and upload it to Strava myself, and I never get a duplicate that way. Happened for the first time in a while just last week. Clearly, whenever Garmin does send the data, Strava is perfectly capable of recognizing an activity it already has, and it does the right thing. I'm just not buying that excuse.

BTW and a bit OT, I find it very impressive that Strava can retroactively create a leaderboard going back years for a newly created segment, meaning that they must evaluate potentially millions of nearby activities for overlaps, often in just a few minutes. That's a hell of a query. Anybody know of more information on how they do it?

2 comments

Millions of activities is approximately zero activities. That's an inconsequential amount of data.

Even if it is a couple orders of magnitude larger than I think geographic partitioning can keep the volume small enough to easily fit in RAM.

It's not the size (though as mdoms explains it's more than you might think); it's the complexity of matching activities to segments at the necessary level of precision. I know what a consequential amount of data looks like, having worked on a storage system that routinely fed multiple petabytes of data to each of many analytics pipelines simultaneously. Geographic partitioning is only what gets you to millions of activities instead of billions. There's still a computational component involved in delivering those answers so quickly, and that's what I'm curious about. There might be some interesting algorithms involved. Your blithe dismissal of any part other than the one you think you understand shows that you either can't understand the whole thing or didn't bother to try.
I doubt the actual leaderboard calculation requires full granular data down to the meter. Segments can be aggregated from the meter level data in advance. I’m sure there are clever choices in creating those aggregates but that seems like the obvious (and admittedly naive) approach.
> Segments can be aggregated from the meter level data in advance.

Not for a segment that doesn't exist. When I create a segment, it goes back over all nearby activities since the dawn of Strava time. That's the part where geographic sharding helps. Then, it has to check each one to see whether it traversed the segment without interruption or deviation - not within one meter but AFAICT within only a few. Does further sharding/tiling help there? Yes, and I'm sure that's part of how it works, but I'm interested in exactly how they apply those techniques and then solve all of the remaining problems - undoubtedly including those that neither you nor I have thought of. As you yourself say, there are probably some clever choices involved, and that's the part I'm curious about. It definitely seems like a "meaty" enough topic to fill out a blog post or two, and the world needs more such IMO.

Absolutely, I am also interested in the details. I'm sure it is very interesting and clever. But I expect the meat is in the ingestion, not in the leaderboard calculation. To provide a new leaderboard so quickly the work must have been done up front.
I spent months working on just getting something that could match GPS logs to "segments" let alone something thats fast and scalable. Geo problems are so much harder than people assume.
Incredibly ignorant statement. An activity can have thousands of data points across dozens of metrics. A record of your route to 1 meter or smaller precision for a 5 hour bike ride is alone thousands of data points. Along with hundreds or thousands of time series points for heart rate, elevation, cadence, power, respiration, etc etc... Millions of activities means tracking tens of billions of data points and perhaps an order of magnitude more than that.
There’s no reason all that granular data would be used to calculate a leaderboard. It can be rolled up in advance and for leaderboard purposes most of it can be ignored.

Parallelism along the location, customer and activity dimensions makes this an easily reducible problem.

I thought they meant duplicate activities on the Apple Health side. If they can't read what the other services are writing to Apple Health then they have no way of avoiding duplicates.
Doesn't that suggest that Apple is failing to detect and coalesce duplicates, like Strava and (AFAIK) other similar services like Garmin Connect or old MapMyRun do? That seems pretty damning, and also not Strava's problem.
It might be substantially easier to detect duplicates coming from the same device (ie. the Strava case) than duplicates that might have been repackaged by any number of N third parties (ie. the Apple case).