Hacker News new | ask | show | jobs
by frankfrank13 838 days ago
I actually needed this at work once! We needed to fuzz peoples address in a mapped view for analytics, without revealing PII. It ended up never being shipped, but we needed to fuzz geographic data and the thinking was like:

1. Truncate your lat longs to some arbitrary decimal place (this is very very stupid, you end up with grid lines [1])

2. The above method ^^ but everyone tries basically doing like random angle + random length along angle, which doesn't generate uniform results in a circle as the article mentions[2]. So then you try generating points in a box that encloses your circle, and rejecting anything outside the circle, but that smells bad. So you do some googling and find method 6 listed in the article (Good! and Fast!)

3. Realize that fuzzing points is stupid, what you really want is to view points in aggregate anyways, so you try heat maps

[1]: Ok you always end up with grid lines, but truncating to like 1-6 decimal places produces very obvious grid lines to the human eye

[2]: Try this in pyplot! You'll see right away with ~100 points its not uniform in the way you expect

5 comments

I needed something related, n roughly evenly distributed points on the surface of a sphere. Ended up using a Fibonacci spiral.

https://extremelearning.com.au/how-to-evenly-distribute-poin...

Hey me too, very interesting problem. I also used this method as well. https://www.sciencedirect.com/science/article/abs/pii/S00104...
Force directed layout is my favorite for this.
Neat. My n was on the order of thousands, so that would have taken a while, but it could have been cached.
...but then the algorithmic complexity goes from O(N) to O(N^2), or at least O(N log N), since points have to interact with each other.

(N denotes the number of points you need to generate)

I confess I've never used this in a situation where time complexity was at all relevant. For one thing, it is possible that a result only needs to be computed once for each N (and per shape, since this approach works on a wider range of shapes than just spheres).
The rejection method smells pretty good to me, in the sense that it should be pretty obvious to anybody with, like, middle school level math I think (right?).

It might fail for higher dimensions, but lots of programs only run on a 3D sphere of a planet, haha!

Fail is an understatement, the ratio between the two volumes is basically (n/2)!(4/pi)^(n/2). Which is also the expected number of tries you'll need. The time doesn't merely grow exponentially it grows faster than exponential.

I don't actually know of any useful algorithms with worse asymptomatic behaviour.

Sure, but for the most-used real-world geometry, n = 2 or 3. So how fast it grows really doesn't matter.

Most maps are 2-dimensional and this is fine.

When I saw "n" in the title I assumed they wanted something that worked reasonably for large n. Rejection is no good because of the notorious "curse of dimensionality". So my idea was to choose a suitable distribution on the radius, then draw from it and choose the angles at random (not sure what those angles are called). You might have to delete the point at the center for that to work.
The article explicitly discusses both. Versions for low dimensions that have favorable properties, and those for high dimensions.

The method you propose is not used in high dimensions in practice as it would involve evaluating order n^2 trigonometric functions and is also far harder to implement than the methods discussed in the article.

Maybe n=4 if you have a timestamp as another dimension.
Do your timestamps occupy a hypersphere?
A 1-dimensional hypersphere is a line segment so sure.
Quibble: I don’t think “fail” is an understatement, it is as strong a word as anything else for something that doesn’t work.

Actually, it is an overstatement I think, or something orthogonal; the algorithm is still correct just wildly inefficient. Fail should be reserved for algorithms that produce the wrong result, not ones that produce the right result very slowly, however slowly.

> I don't actually know of any useful algorithms with worse asymptomatic behaviour.

Note: "asymptomatic" does not mean the same as "asymptotic"

Most of my code has bad asymptomatic bugs but they never show up in testing because the tests don’t have data sets big enough to show the symptoms!
I doubt that the number of dimensions is the parameter that grows. It’s going to be constant for 99.99% of cases.
I'd just have used Uber's hexagonal tiling library. https://www.uber.com/blog/h3/
This is what I landed on! It actually works very very well for aggregate data.
Long time ago, I had the same problem while ray tracing using Monte Carlo techniques. Using Mersenne Twister fixed the clustering and grid like randomization. https://en.m.wikipedia.org/wiki/Mersenne_Twister
What if someone lives in a really remote location so they're the only ones in the heat map cell?
I should think after defining a uniform distribution of points, you could cluster them as needed to form larger lower-resolution chunks according to population size. Binning everyone in that region to say the central most point. Which could then be adaptive as the populations change.
This is the correct answer. You need an aggregation threshold that restricts precision when population count is too low. In this case, the cell size needs to increase until there are at least N users.
Another user here mentioned Uber's h3, which is actually what I used. You end up being able to anonymize over arbitrarily large geographic areas using something like a tile, rather than a point
Assign everyone a random offset, that doesn't change, that is large enough to obscure the address with some reasonable radius but small enough to not drastically skew the heat map at a lower zoom level