Who Is “Public” Data For? | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Who Is “Public” Data For? (lithub.com)
	41 points by zverok 1912 days ago

8 comments

Animats 1909 days ago

If you insist on government data being cleaned up, explained, and made "accessible" in the politically correct sense, far less raw data will be released. What gets released will probably have been scrutinized and "redacted". The good stuff comes from vast raw files used for some internal purpose.

jcims 1909 days ago

Agreed. Access first, invest in the accommodations once the data proves useful. Unless your day job is releasing easily consumed data to the public, this likely isn't your top priority.

teucris 1909 days ago

> “accessible" in the politically correct sense

I don’t know how you could get to this as being the point of this article. It’s calling for basic data-cleaning that is much cheaper for the publisher than the consumer, making it inaccessible to people who could really use it.

moksly 1912 days ago

Here’s the thing about public data from the perspective of the public sector.

We operate 300+ different systems, some are tiny, some are major. Some are build as “modern” web-fronts with healthy APIs behind them, some are older than you and run on a mainframe behind multiple layers of APIs that have been added over the years as demands changed. Most are build and maintained through public procurement processes, and some have been through several different companies and a myriad of different approaches to how you build software. Some were build by companies that haven’t existed for a decade, designed by people who are retired. Many share data but none of them agree on data models. Even if you have natural enterprise architecture descriptions of what a specific data model for a “person “ should look like when it goes in or out of a system, that’s not a guarantee that it’s also what a “person” actually looks like inside the system. Many suppliers make a lucrative “additional content” business out of selling BI modules that translate their data into something our analytics departments can use, and as such, have a vested interest in not making the data too accessible on its own. Many are clever enough to set up contracts that don’t give us access to our own data without buying it as an option, or often, as several options for the various parts of data. On top of this, almost none of these systems were (or are) designed to have clever data models or good documentation. That costs money, meaning your bid will likely lose the procurement process, but it also makes it easier for another company to “outbid” you in the next procurement process should you win.

Anyway, the point is that our data is a mess. As this article points out. What’s worse, it’s not just a mess, it’s a lot of different messes. We map our public toilets in our GIS systems, so do the 97 other communes of Denmark, but because we don’t have the same GIS systems, and because we don’t have the same data access agreements, you could easily end up with 98 different data sets, some being “never-updated” xmls, others being json, others being txt files and some being rest/soap/you name it APIs. This article is completely right when it tells you that this sort of sucks. In theory you could use the data to make an App that directed you to the nearest public toilet in Denmark, but good luck with the state or the data.

Here is the biggest issue that Open Dara faces. The political decision making layer doesn’t really care that Ronald Reagan released the GPS data “and look what happened”, not when they are distributing funding and have to decide whether they want to spend money on “good Open Data” or more teachers.

indigochill 1909 days ago

Maybe I'm naive, but this seems to be in a similar class of problem to Linux's support for physical peripherals, and the answer seems to be the same: drivers that convert from a mess of different interfaces into a standard interface one layer up.

And like driver development, it can be done by pretty much anyone. Doesn't have to be the low-budget data providers, nor does it all have to be done by one group, as long as the various groups doing the work can agree on a standard interface.

zdkl 1909 days ago

This is a good idea, but unless *you* have some political capital and room in the budget to design AND perform AND maintain the process long term... You'll only add another competing "standard" to the mess. https://xkcd.com/927/

atkbrah 1909 days ago

Up until you mentioned Denmark, I was 100% sure you talked about my home country, Finland.

amelius 1909 days ago

Wouldn't most of the data you mentioned eventually become part of projects like OpenStreetMap?

bregma 1909 days ago

The only reason to not make something public is because you're trying to hide something.

Even if it's just that 'your data is a mess'... you want to hide the fact that your data is a mess.

Data is either open, or you're trying to hide something. You're lying about something. Period.

edmundsauto 1909 days ago

This is not true - there are plenty of other reasons.

- it requires more effort than not releasing it - it requires approvals from other branches that are difficult to receive - your staff don’t have the expertise (yes, this is a real thing some places!) or budget

I’m sure there are many other reasons too.

jcims 1909 days ago

TFA is arguing that raw, unsupported data that is difficult to consume is neither 'public' or 'open'.

Regarding hiding data, there can be many good justifications for that. Carelessly sharing data with the public is not particularly safe or wise.

franga2000 1909 days ago

Having tried to explain to paper-pushers why I suggest they release some data they have to the public, it always came down to how much effort and therefore time==money it will take. Throwing some existing spreadsheets onto a download site takes only a few hours (yes, hours, that's how big buerocracy moves), after which they can move on with their actual work.

Any kind of reshaping of the data needs to be done by a data analyst and those are usually not readily available to work on projects that ultimately give no value to the organisation itself. Creating APIs takes developer time and server hosting/maintenance costs. Writing documentation takes almost as long as developing and if you now also want to provide it in multiple languages...

Accessibility is awesome and should be a hard requirement in any essential system, but anything that's entirely non-essential, few people care about and os done pretty much out of kindness and nothing else (has no ROI) should not be expected to be perfect or even good. There was a story some time ago about some lectures being provided online for free that later had to be taken down because they didn't have captions for the visually impaired. Don't let open-ish data go the same way.

sien 1909 days ago

The FAIR data idea is intended to help with this:

https://en.wikipedia.org/wiki/FAIR_data

It's a good idea. People are trying to do it. But it's work for someone and often that work isn't directly appreciated.

Having a start of don't waste the data in someone's filing cabinet or on a personal hard disk is a the biggest step.

The Australian Government also has https://data.gov.au/

Now the organisations that provide all that data just have to figure a good way to quantitively show the value in a realistic manner.

throwaway3699 1909 days ago

UK: http://data.gov.uk

They also have a COVID-19 dataset with a full developers guide: http://coronavirus.data.gov.uk

photonios 1909 days ago

Netherlands has: https://data.overheid.nl/en

timdaub 1909 days ago

It's not exactly "public" data, but I have a project to make data more accessible and transparent: https://rugpullindex.com

blueblisters 1909 days ago

What is this exactly? Is it a price discovery mechanism for datasets?

timdaub 1909 days ago

The price discovery is built by Ocean Protocol. I merely use their APIs to give users an overview of what is happening within their data markets.

billsmithaustin 1909 days ago

Rather than debate what “open data” really means, maybe there need to be more specific terms for the Op’s different aspects of openness.

vharuck 1909 days ago

My primary job is aggregating and contextualizing health data. I don't think people know how many ways there are to get data from government sources. In my section, we'll work with anyone in any way. I've mailed printed reports, shipped CDs, emailed CSVs, and sent collections of links to various data sources and their stewards. We've received requests by letter, phone, fax, email, and walk-in. We've stopped publishing physical and bound copies of reports, but will print them out if asked. People without computers have access to everything we publish online. It'll be hard to know what's available, but that's no different than how things were before the internet.

Seriously though, if you want data in a different format or on something they don't publish, call or email the agency and make a data request. If you don't understand the context, ask them. Personally, I love going into the details of how and why we analyze the data. Other people get crowds when talking about their jobs, but nobody wants to hear me gush about R programming or explain gripe about the latest change in cancer epidemiology standards. I happily spend hours crafting and revising detailed replies to simple questions.

I'll agree with the author that published data shouldn't have arbitrary and inscrutable "metadata." Such as field names meaningless even to other experts (looking at you, US Census Bureau). But it's unreasonable to think non-experts should be able to understand and properly use all data products. Read summarized reports if you want data with context.

>on data.gov’s impact page you’ll find a kind of hall-of-fame list of companies that are “public data success stories”: Kayak, Trulia, Foursquare, LinkedIn, Realtor.com, Zillow, Zocdoc, AccuWeather, Carfax. All of these corporations have, in some fashion, built profit models around public data, often charging for access to the very information that the state touts as “accessible, discoverable, and usable.”

First, consider an alternate to listing companies. Should individuals be listed? How many people works care if there was a race review from Dr. Jane Doe who used the data in an article?

Second, the companies charge for the collection, processing, and reporting of data. That's added value. If you want the government to add that value in-house, then push your government to hire somebody to do it. I agree, it'll be the cheapest option for society. Government data fees are small, usually just enough to cover hours worked (bonus: government underpays tech workers). But you need to convince them it's a valuable position that'll help private citizens and themselves (agencies have to make data requests of one another).

djoldman 1909 days ago

I don't comment on grammar as it's not generally in the spirit of assuming good intention and borders on snark. However, from a website called lithub, they should really change the title to:

For whom is public data?