Hacker News new | ask | show | jobs
by Hominem 4373 days ago
I think the way cabs actually operate in NYC makes this practically impossible unless you already have some details such as the lat/Lon of dropoff and pickup and time of the stops.

I'm assuming the data is for yellow cabs and the new lime green "boro cabs" you hail on the street, not "car service" cars where you schedule a pickup and dropoff to specific addresses.

Most bars in Manhattan are storefronts in 3-4 story residential buildings. There are apartments above and they are surrounded by other buildings with apartments and businesses. I don't think you could identify a bar. Now strip clubs on the other hand are required by law to be tucked away in isolated locations. Might be possible to identify a strip club.

When you hail a cab, and many times when you get dropped off, it happens on a corner, perhaps over a block away, where it is easier to find a free cab.

Most cabs are in Manhattan, not a lot of single family homes. Single family homes in the outer boroughs will have almost no yellow cab coverage for pickup and finding a cab that will take you out of Manhattan can be dicey, although I guess those lime green cabs are meant to address that. SI, the Bronx and huge swaths of Brooklyn and Queens pickups will be almost non existant, people going from the outer boroughs will most likely use a car service.

I will certainly be checking to see if I can identify any of my rides.

4 comments

This. If you've lived in manhattan for more than a month, you'd know that pickup and dropoff locations are not precise, specifically:

1) you never get a cab on quiet single-family condo streets - gotta get to corner of an avenue

2) cabbies often click meter to off half a block before you actually say "stop right here please, between the drunken couple and the pile of garbage on the left side". They do this so you pay and get out quicker, clearing way for another passenger.

3) There are a LOT of "skyscrapers" in manhattan, with 300+ apts in each

What WOULD be interesting is taking credit card logs of someone's cab payments and cross-matching dropoff based on charge timestamp :)

Most of the comments here about pickup/dropoff accuracy and large buildings suffer from the same logical flaw: "often" is not the same as "never".

With comprehensive data set of literally 173 million trips, even if we limit ourselves to precise locations in front of small buildings and residences -- let's say it's a paltry 5% -- that's still 8 million trips.

That's more than enough to invade the privacy of a very large number of people.

And that's just the low hanging fruit. With geolocation data you don't always need precise location accuracy or small buildings to see identifying patterns. Don't forget that time is also a very useful factor, and often precise to the minute. E.g. trips departing after 1am within a half-block radius of the only bar in that radius are more likely than not to be patrons. And trips arriving at an apartment building at at particular time may be relatively rare, making it easy to look up the single trip that matches it.

Thus, a neighbor or roommate who saw someone arrive and noted the time (or had a security camera) might be able to deduce the bar that they visited, address or block of the person they're dating, whether they were actually where they said they were... That's one of a zillion scenarios. Precise address-to-address trips are just the low hanging fruit.

> strip clubs on the other hand are required by law to be tucked away in isolated locations

Really? I used to live across the street from one in Manhattan. It wasn't an illicit club, and I didn't live in some sort of squat house. Dancers and patrons going there would be indistinguishable from people going to my building, apart from the address being off by one.

Behold, a "Gentlemen's Cabaret" club and a porn shop between a falafel place, camera store, burger joint, and residential flats: https://maps.google.com/maps?ll=40.758115,-73.989143&spn=0.0...

> unless you already have some details such as the lat/Lon of dropoff and pickup and time of the stops.

These are literally columns in the data set. To quote the original post:

Each file has about 14 million rows, and each row contains medallion, hack license, vendor id, rate code, store and forward flag, pickup date/time dropoff date/time, passenger count, trip time in seconds, trip distance, and latitude/longitude coordinates for the pickup and dropoff locations. [1]

You're right that a lot of cab pickups/dropoffs happen a few doors down from the actual location, and that there aren't a lot of single family homes in Manhattan. But that doesn't negate what I'm saying. Even if only 20% of rides involve the actual location, that's still an awful lot of potential privacy violations. And even if there are zero single family homes involved, that was only the first scenario of numerous ones I mentioned.

[1] http://chriswhong.com/open-data/foil_nyc_taxi/

What I meant was if you already know the time and location, say you know what time the barista left work and the lat/Lon of the coffee shop. It would just be a matching it up with the data in the table to find the drop off.

What I don't think is practical is identifying everyone who may or may not have left a particular bar.

> What I don't think is practical is identifying everyone who may or may not have left a particular bar.

This is weaker than your original claim which was that deducing passenger identities is "practically impossible". You've now conceded that the barista scenario is plausible and left open several others I mentioned.

But let's examine this one, just bars. The basis of your criticism is that some bars are located in residential buildings. First off this still leaves quite a number of bars that aren't. But even for those that are, the time of day and direction of travel is a pretty fair indicator of people who are bar patrons vs. residents. I.e. trips departing the building after 1am and arriving at a residential location are probably a lot more likely to be bar patrons than residents.

And don't forget that this public data set is also potentially privacy-violating when combined with other data about the destination, such as information that other residents of that location may know. So even if the general public couldn't determine much from a trip from a gay bar to a home residence one night, a live-in parent could.