Hacker News new | ask | show | jobs
by chockablock 4224 days ago
In another deleted post [0] the author talks about using a name-to-gender API to look at ride locations by gender, which implies that these analyses were not done using anonymized data.

[0] https://web.archive.org/web/20140827195715/http://blog.uber....

2 comments

You have to start with the original data, which is obviously de-anonymized. Full data -> [gender, time, origin neighborhood, destination neighborhood] leaves you with a pretty anonymous dataset, and is all that would be required for this analysis.

Internal metrics teams nearly always have access to complete data. The issue is sharing non-anonymized data externally.

I agree it's possible that the name-to-gender mapping was done before the full ride data was handed over to this analyst. (Though just removing real names would still leave a lot to be desired in the anonymizing process).

However there's no mention in these posts of such safeguards, and subjectively the post reads more like the analyst is just fishing around in the full raw dataset of ride times, start and end locations, and names. To wit:

"What else can we learn? First, we can devise a way to statistically assess whether there are more women or men in a neighborhood than we’d expect. [...] We used Rapleaf’s Name to Gender API to assess the likelihood of a rider’s gender given their name, only accepting a match if the probability was >= 95%."

And in the original post, he categorizes rides as possibly related to a late-night hookup based on whether the destination and departure points for 2 rides are within 0.1 mi of each other.

>Internal metrics teams nearly always have access to complete data. The issue is sharing non-anonymized data externally.

I disagree pretty strongly with this. Do you think that your average Uber rider would be OK with Uber employees analyzing their ride patterns (with their real names attached) to try to figure out where and when they are having sex? Do you think Uber should allow such access to its employees by policy? (It seems we agree that writing a blog post about it is not a great idea.)

> Do you think that your average Uber rider would be OK with Uber employees analyzing their ride patterns (with their real names attached) to try to figure out where and when they are having sex?

I don't see how this is any different from Google analyzing search data to try and figure I'm pregnant. You could make the argument that "its a algorithm" but at one point someone had to sit down and build that model.

> Do you think that your average Uber rider would be OK with Uber employees analyzing their ride patterns (with their real names attached) to try to figure out where and when they are having sex?

Sure, as long as Uber isn't broadcasting that information with their name attached. The average person really doesn't care about (or understand the extent of) data analysis (from companies or the government) -- what they care about is public disclosure which may mean personal embarrassment or a lawsuit or other form of inconvenience. People who want to control all their data are hoping for a fantasy world where observations and inferences by third parties are magically made impossible. The reasonable thing to focus lawmaking efforts on is limiting legal forms of disclosure and standardizing safe storage requirements for the raw data -- indeed such laws already exist, with the HIPPA privacy rule perhaps being the best known in the US.

HIPAA's not a great example for you to use, since it does in fact limit access to protected information by employees (under a 'minimum necessary' standard) [0]. You can even serve time in federal prison for a violation without disclosing anything [1].

>People who want to control all their data are hoping for a fantasy world where observations and inferences by third parties are magically made impossible.

I think you are setting up a straw man here. What I suspect the average user expects is for their sensitive personal data to be dealt with in a professional and respectful way, with protections against abuse by rogue employees. There are plenty of companies who deal with private data and understand this well. Potatolicious had a comment on another Uber thread detailing the hoops an Amazon employee has to go through to get access private customer data [2].

Scrubbing these posts suggests that Uber realizes that they have a real problem, at least at the PR level. I wouldn't be surprised if they are also getting more serious about controls on internal access to ride data.

[0] http://www.hhs.gov/ocr/privacy/hipaa/understanding/covereden...

[1] http://dailybruin.com/2010/05/05/former-ucla-medical-center-...

[2] https://news.ycombinator.com/item?id=8624945

I think HIPPA is a great example precisely because it goes beyond "don't disclose this", it also regulates "safe storage requirements", whose purpose is ultimately to make unwanted disclosure (through breaches, rogue employees, etc.) less likely, of whatever scale. (e.g. my plaintext password for a service shouldn't ever be disclosed to even a single person.) I think we're in agreement about people generally expecting professionalism.
If you think that "pretty anonymous" is a thing, you should probably check out http://33bits.org/
Personally, I think they do it with raw access to the database. https://www.uber.com/legal/usa/privacy under the heading "How do we use the information collected" says nothing about anonymizing rideshare data.