If you insist on government data being cleaned up, explained, and made "accessible" in the politically correct sense, far less raw data will be released. What gets released will probably have been scrutinized and "redacted". The good stuff comes from vast raw files used for some internal purpose.
Agreed. Access first, invest in the accommodations once the data proves useful. Unless your day job is releasing easily consumed data to the public, this likely isn't your top priority.
I don’t know how you could get to this as being the point of this article. It’s calling for basic data-cleaning that is much cheaper for the publisher than the consumer, making it inaccessible to people who could really use it.
Here’s the thing about public data from the perspective of the public sector.
We operate 300+ different systems, some are tiny, some are major. Some are build as “modern” web-fronts with healthy APIs behind them, some are older than you and run on a mainframe behind multiple layers of APIs that have been added over the years as demands changed. Most are build and maintained through public procurement processes, and some have been through several different companies and a myriad of different approaches to how you build software. Some were build by companies that haven’t existed for a decade, designed by people who are retired. Many share data but none of them agree on data models. Even if you have natural enterprise architecture descriptions of what a specific data model for a “person “ should look like when it goes in or out of a system, that’s not a guarantee that it’s also what a “person” actually looks like inside the system. Many suppliers make a lucrative “additional content” business out of selling BI modules that translate their data into something our analytics departments can use, and as such, have a vested interest in not making the data too accessible on its own. Many are clever enough to set up contracts that don’t give us access to our own data without buying it as an option, or often, as several options for the various parts of data. On top of this, almost none of these systems were (or are) designed to have clever data models or good documentation. That costs money, meaning your bid will likely lose the procurement process, but it also makes it easier for another company to “outbid” you in the next procurement process should you win.
Anyway, the point is that our data is a mess. As this article points out. What’s worse, it’s not just a mess, it’s a lot of different messes. We map our public toilets in our GIS systems, so do the 97 other communes of Denmark, but because we don’t have the same GIS systems, and because we don’t have the same data access agreements, you could easily end up with 98 different data sets, some being “never-updated” xmls, others being json, others being txt files and some being rest/soap/you name it APIs. This article is completely right when it tells you that this sort of sucks. In theory you could use the data to make an App that directed you to the nearest public toilet in Denmark, but good luck with the state or the data.
Here is the biggest issue that Open Dara faces. The political decision making layer doesn’t really care that Ronald Reagan released the GPS data “and look what happened”, not when they are distributing funding and have to decide whether they want to spend money on “good Open Data” or more teachers.
Maybe I'm naive, but this seems to be in a similar class of problem to Linux's support for physical peripherals, and the answer seems to be the same: drivers that convert from a mess of different interfaces into a standard interface one layer up.
And like driver development, it can be done by pretty much anyone. Doesn't have to be the low-budget data providers, nor does it all have to be done by one group, as long as the various groups doing the work can agree on a standard interface.
This is a good idea, but unless *you* have some political capital and room in the budget to design AND perform AND maintain the process long term... You'll only add another competing "standard" to the mess.
https://xkcd.com/927/
This is not true - there are plenty of other reasons.
- it requires more effort than not releasing it
- it requires approvals from other branches that are difficult to receive
- your staff don’t have the expertise (yes, this is a real thing some places!) or budget
Having tried to explain to paper-pushers why I suggest they release some data they have to the public, it always came down to how much effort and therefore time==money it will take. Throwing some existing spreadsheets onto a download site takes only a few hours (yes, hours, that's how big buerocracy moves), after which they can move on with their actual work.
Any kind of reshaping of the data needs to be done by a data analyst and those are usually not readily available to work on projects that ultimately give no value to the organisation itself. Creating APIs takes developer time and server hosting/maintenance costs. Writing documentation takes almost as long as developing and if you now also want to provide it in multiple languages...
Accessibility is awesome and should be a hard requirement in any essential system, but anything that's entirely non-essential, few people care about and os done pretty much out of kindness and nothing else (has no ROI) should not be expected to be perfect or even good. There was a story some time ago about some lectures being provided online for free that later had to be taken down because they didn't have captions for the visually impaired. Don't let open-ish data go the same way.
My primary job is aggregating and contextualizing health data. I don't think people know how many ways there are to get data from government sources. In my section, we'll work with anyone in any way. I've mailed printed reports, shipped CDs, emailed CSVs, and sent collections of links to various data sources and their stewards. We've received requests by letter, phone, fax, email, and walk-in. We've stopped publishing physical and bound copies of reports, but will print them out if asked. People without computers have access to everything we publish online. It'll be hard to know what's available, but that's no different than how things were before the internet.
Seriously though, if you want data in a different format or on something they don't publish, call or email the agency and make a data request. If you don't understand the context, ask them. Personally, I love going into the details of how and why we analyze the data. Other people get crowds when talking about their jobs, but nobody wants to hear me gush about R programming or explain gripe about the latest change in cancer epidemiology standards. I happily spend hours crafting and revising detailed replies to simple questions.
I'll agree with the author that published data shouldn't have arbitrary and inscrutable "metadata." Such as field names meaningless even to other experts (looking at you, US Census Bureau). But it's unreasonable to think non-experts should be able to understand and properly use all data products. Read summarized reports if you want data with context.
>on data.gov’s impact page you’ll find a kind of hall-of-fame list of companies that are “public data success stories”: Kayak, Trulia, Foursquare, LinkedIn, Realtor.com, Zillow, Zocdoc, AccuWeather, Carfax. All of these corporations have, in some fashion, built profit models around public data, often charging for access to the very information that the state touts as “accessible, discoverable, and usable.”
First, consider an alternate to listing companies. Should individuals be listed? How many people works care if there was a race review from Dr. Jane Doe who used the data in an article?
Second, the companies charge for the collection, processing, and reporting of data. That's added value. If you want the government to add that value in-house, then push your government to hire somebody to do it. I agree, it'll be the cheapest option for society. Government data fees are small, usually just enough to cover hours worked (bonus: government underpays tech workers). But you need to convince them it's a valuable position that'll help private citizens and themselves (agencies have to make data requests of one another).
I don't comment on grammar as it's not generally in the spirit of assuming good intention and borders on snark. However, from a website called lithub, they should really change the title to: