Hacker News new | ask | show | jobs
by jandrewrogers 3257 days ago
I've designed systems that do this on continental scales (i.e. hundreds of millions of cell phones simultaneously, in real-time). The devil is in the details and non-trivial; this is not a "an intern and 6 months" job. Mobile telemetry is not nearly as ideal in practice as assumed here and it typically takes a couple years to learn how to handle the numerous peculiar artifacts of that data that will damage the quality of a naive implementation. Reconstructing a model of the population from the cleaned data that approximates the ground truthing is surprisingly difficult and requires quite a bit of clever data science and maths.

It takes a lot of work and expertise to build a population model from mobile telemetry that approximately reflects reality. Far fewer people know how to do this well than you might assume by looking at the requirements for a naive implementation. Even most mobile carriers have limited ability.

3 comments

Carrier iQ demonstrated GPS level tracking of 100m+ phones nearly 10 years ago.
> the numerous peculiar artifacts of that data that will damage the quality of a naive implementation.

Do you have examples (or a link/reference to something that has such examples) of those types of artifacts?

Have you posted about this at length somewhere? If not, care to elaborate on what it took to design this system?
I have not written about it. Most of the difficulty and complexity, from my perspective, is in the data science and processing required to construct an accurate population model, which requires additional data sources beyond the mobile telemetry. I designed the custom database platforms (easy for me) underneath which supported the online data processing.

It isn't that difficult technically, if you have experts doing it, it just requires far more domain expertise to do correctly than I think people expect. You also need to be willing to write some of your own tooling to deal with the data efficiently and effectively.

A great PoC is fairly doable by an intern.

Increasing the precision by tenfold will likely increase the effort by a hundred fold or more. Just because it can be made harder and more expensive doesn't mean it has to be.

At the end of the day, a bit of precision doesn't change the nature of an effective planet scale mass surveillance system.

I'm gonna take a wild guess, and say that the NSA has a monopoly on the talent for this field.
The Snowden leaks confirmed NSA has the ability to conduct co-traveler inference.[0] In other words: finding mobile devices proximate to a targeted mobile device, based on similar vectors. Perhaps even making associations in absence of targeting via patterns in device proximity over time.

It probably gets real interesting when they're trying to distinguish between various modes of transit, such as a city bus, an Uber/Lyft/taxi, and a private vehicle not participating in rideshares. Of those examples, the latter would suggest the highest degree of association.

Pure speculation, but I wouldn't doubt they take a peek at ridesharing data for co-traveler inference purposes. Knowing if a rideshare driver is on or off the clock would be incredibly valuable information in that context.

[0] https://www.washingtonpost.com/apps/g/page/world/how-the-nsa...

Graph reconstruction from space-time event data goes far beyond the above in terms of capability. You can infer relationships between people that never co-travel, infer that people have been places that are not in the event data, etc by stitching together large numbers of orthogonal event streams over long periods of time. It is straightforward to distinguish between various modes of transit analytically. The "metadata" that simply indicates an event in space-time is far more valuable analytically than the data because it is possible to reconstruct so much with it that isn't contained within the data per se.

I was doing all of this five years ago, the capability has been around for a while.

Do you ever worry that this data might be used for bad purposes?
The handful of people I know that are real experts at the data science are all in the private sector.
Mapping is wide and common industry, just like web or finance.

The NSA only recruits in the USA. It's a fraction of the talent pool of the planet.

Its actually smaller. At minimum you have to be a naturalized or native born citizen of the USA for many jobs in the US Government. For the NSA, add the requirement of having a current clearance or the time to get one (worst case years).
NSA violate the constitution on a daily basis, I'm sure they have a loophole to get whatever talent they need
Care to elaborate?
Most of this talent is in the private sector. Primarily around traffic.
> which requires additional data sources beyond the mobile telemetry.

So it really isn't technically difficult. It's just lacking in data?

In your original post, you made it seem like it was extremely difficult. From my perspective, it seems like child's play from a technical standpoint.

> It isn't that difficult technically, if you have experts doing it

But your top comment implied it was extremely technically difficult.

The are some things that are technically very difficult if you have no domain expertise, which applies to the subject matter. Most people that try to do this without experience fail in practice, it takes a lot of time and effort to become competent at it, but once you figure it out it is repeatable without too much effort.

There is a much smaller set of things that are technically difficult to execute even if you are highly experienced at doing it -- each time is a challenge. This is not one of those cases, it just has a severe learning curve.