Hacker News new | ask | show | jobs
by _vegp 4250 days ago
So this situation happens in the news world all the time. While a company or agency has original databases, excel sheets, what have you - they don't consider that publishing them in a "human-readable" format is nearly the same thing as publishing the raw data. Try calling the place for a copy, and they'll hang up on you. But, they won't think that a crafty outsider can probably reconstruct the original by scraping.

What's particularly interesting here is guessing the motivation behind publishing. Was the information a trade secret, or did a middle-manager want to show that their team is ahead of the others? Or are these feathers to show the company has the know-how and capability?

In either case, most of the web-published data isn't initially considered as published data by the publishers, who in turn don't think to state any restrictions governing the data. That's when we scrape and make use of it - and even if there are restrictions on republishing, you can still perform and claim transformative derivative work.

The fun legalese part is what happens when they discover what you're doing and try to lash out, or interrupt a standing scrape. One time, all it took to unblock access was to show up at a meeting and get yelled at by a police captain for 30 minutes. Our retort started with "In the interest of public safety, ..."

1 comments

> One time, all it took to unblock access was to show up at a meeting and get yelled at by a police captain for 30 minutes. Our retort started with "In the interest of public safety, ..."

I'd like to hear more about your example.

Well, the PD found out that we were scraping and publishing data when a superior asked them about it. They were embarrassed and ambushed. Imagine your boss asking you "hey data guy, when did we start sending data to the paper?"

The data itself was public safety information and there was every reason to publish it. Anyhow, our access got cut off and when we inquired about it, they setup a meeting at their headquarters instead of providing any answers. That morning, I showed up at their deathstar-looking building with my editor and we spent 30 minutes getting chewed out by guys in uniforms, suits and badges for "incorrect geocoding" and other false information that we were publishing.

We said that yes, there were some errors but that we took every reasonable attempt to validate it (see http://pp19dd.com/2009/02/vessels-in-distress/). After the guy running the show vented, he showed us the proper way to geocode and correct errors during which time I was thinking "uh, why not send us the lat/lng that you're showing us here, instead of berating us?"

The compromise was that they'd add "precint zone" information to the dataset, and we could proceed so long as we checked whether a geocoded point was within the zone. We promised to check this process with a point-in-polygon algorithm, and the guy was happy as a clam that we took note of his work and gave him respect. After that, he eased up and showed us some of the other cool stuff the PD data guys were working on. For example, they pre-plot escape vectors for burglaries so when cops are dispatched, they first go to where bad guys are likely running to, not where they ran from.