| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by frankfrank13 838 days ago

I actually needed this at work once! We needed to fuzz peoples address in a mapped view for analytics, without revealing PII. It ended up never being shipped, but we needed to fuzz geographic data and the thinking was like:

1. Truncate your lat longs to some arbitrary decimal place (this is very very stupid, you end up with grid lines [1])

2. The above method ^^ but everyone tries basically doing like random angle + random length along angle, which doesn't generate uniform results in a circle as the article mentions[2]. So then you try generating points in a box that encloses your circle, and rejecting anything outside the circle, but that smells bad. So you do some googling and find method 6 listed in the article (Good! and Fast!)

3. Realize that fuzzing points is stupid, what you really want is to view points in aggregate anyways, so you try heat maps

[1]: Ok you always end up with grid lines, but truncating to like 1-6 decimal places produces very obvious grid lines to the human eye

[2]: Try this in pyplot! You'll see right away with ~100 points its not uniform in the way you expect

5 comments

bagels 838 days ago

I needed something related, n roughly evenly distributed points on the surface of a sphere. Ended up using a Fibonacci spiral.

https://extremelearning.com.au/how-to-evenly-distribute-poin...

milleramp 838 days ago

Hey me too, very interesting problem. I also used this method as well. https://www.sciencedirect.com/science/article/abs/pii/S00104...

alanbernstein 838 days ago

Force directed layout is my favorite for this.

bagels 838 days ago

Neat. My n was on the order of thousands, so that would have taken a while, but it could have been cached.

fjkdlsjflkds 838 days ago

...but then the algorithmic complexity goes from O(N) to O(N^2), or at least O(N log N), since points have to interact with each other.

(N denotes the number of points you need to generate)

alanbernstein 838 days ago

I confess I've never used this in a situation where time complexity was at all relevant. For one thing, it is possible that a result only needs to be computed once for each N (and per shape, since this approach works on a wider range of shapes than just spheres).

bee_rider 838 days ago

The rejection method smells pretty good to me, in the sense that it should be pretty obvious to anybody with, like, middle school level math I think (right?).

It might fail for higher dimensions, but lots of programs only run on a 3D sphere of a planet, haha!

contravariant 838 days ago

Fail is an understatement, the ratio between the two volumes is basically (n/2)!(4/pi)^(n/2). Which is also the expected number of tries you'll need. The time doesn't merely grow exponentially it grows faster than exponential.

I don't actually know of any useful algorithms with worse asymptomatic behaviour.

ianferrel 838 days ago

Sure, but for the most-used real-world geometry, n = 2 or 3. So how fast it grows really doesn't matter.

Most maps are 2-dimensional and this is fine.

throwaway81523 838 days ago

When I saw "n" in the title I assumed they wanted something that worked reasonably for large n. Rejection is no good because of the notorious "curse of dimensionality". So my idea was to choose a suitable distribution on the radius, then draw from it and choose the angles at random (not sure what those angles are called). You might have to delete the point at the center for that to work.

Certhas 838 days ago

The article explicitly discusses both. Versions for low dimensions that have favorable properties, and those for high dimensions.

The method you propose is not used in high dimensions in practice as it would involve evaluating order n^2 trigonometric functions and is also far harder to implement than the methods discussed in the article.

ls612 838 days ago

Maybe n=4 if you have a timestamp as another dimension.

alanbernstein 838 days ago

Do your timestamps occupy a hypersphere?

ls612 838 days ago

A 1-dimensional hypersphere is a line segment so sure.

bee_rider 837 days ago

Quibble: I don’t think “fail” is an understatement, it is as strong a word as anything else for something that doesn’t work.

Actually, it is an overstatement I think, or something orthogonal; the algorithm is still correct just wildly inefficient. Fail should be reserved for algorithms that produce the wrong result, not ones that produce the right result very slowly, however slowly.

fjkdlsjflkds 838 days ago

> I don't actually know of any useful algorithms with worse asymptomatic behaviour.

Note: "asymptomatic" does not mean the same as "asymptotic"

bee_rider 837 days ago

Most of my code has bad asymptomatic bugs but they never show up in testing because the tests don’t have data sets big enough to show the symptoms!

zeroonetwothree 838 days ago

I doubt that the number of dimensions is the parameter that grows. It’s going to be constant for 99.99% of cases.

esafak 838 days ago

I'd just have used Uber's hexagonal tiling library. https://www.uber.com/blog/h3/

frankfrank13 837 days ago

This is what I landed on! It actually works very very well for aggregate data.

longhaul 838 days ago

Long time ago, I had the same problem while ray tracing using Monte Carlo techniques. Using Mersenne Twister fixed the clustering and grid like randomization. https://en.m.wikipedia.org/wiki/Mersenne_Twister

chpatrick 838 days ago

What if someone lives in a really remote location so they're the only ones in the heat map cell?

nighthawk454 838 days ago

I should think after defining a uniform distribution of points, you could cluster them as needed to form larger lower-resolution chunks according to population size. Binning everyone in that region to say the central most point. Which could then be adaptive as the populations change.

xyzzy_plugh 838 days ago

This is the correct answer. You need an aggregation threshold that restricts precision when population count is too low. In this case, the cell size needs to increase until there are at least N users.

frankfrank13 837 days ago

Another user here mentioned Uber's h3, which is actually what I used. You end up being able to anonymize over arbitrarily large geographic areas using something like a tile, rather than a point

nick7376182 838 days ago

Assign everyone a random offset, that doesn't change, that is large enough to obscure the address with some reasonable radius but small enough to not drastically skew the heat map at a lower zoom level