| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wsprague 6105 days ago

The problem with data.gov and sites like it is that they are built on faulty premises about data:

1. Fiction: Data doesn't require lots of work to make it useable, so we can just upload whatever we have and it will be useful to somebody. Fact: the big useable datasets (census, ipums, nlsy, all the private marketing datasets) have armies of people cleaning and integrating them. It costs money, it takes time, and it is easy to screw up.

2. Fiction: Links are worth something. Fact: links are worthless.

3. Fiction: XML adds values. Fact: aascii tab delimited in consistent formats add value, while XML SUBTRACTS value.

4. Fiction: a good dataset is easy to use. Fact: even a good dataset (google IPUMS for an example) takes a lot of work to get to know how to manipulate, presuming one can use some sort of statistical programming language in the first place.

5. Fiction: simple summaries of common data data are useful. Fact: everybody has already done the simple summaries. (This is just a bonus item, and doesn't apply to data.gov, but does apply to faulty thinking about data in general.)

6. Fiction: Federated data is just fine. Fact: Data that is curated, cleaned, and integrated into one big monolithic package is FAR better, because an analyst can then learn the conventions and names and such in one piece, and parallel categories are more likely to align.

7. Fiction: Good data is easy for a layperson to use. Fact: good data still requires a lot of skill. Well, maybe in nations with decent public schools a layperson can do something with data, but not in the US.

What I WOULD like is the following (taken from another post, now deleted):

An ideal data.gov would have a lot of staff who put together a few integrated and curated datasets from the agencies. These would be hierarchies of data in a few formats (shp, txt, raster, SQL text dump, and ...?), along with well written codebooks and narrative READMEs. They would be distributed using git or subversion. The staff would have the expertise to make such nice data packages for you and me, and they would have the political oomph to demand that the agencies release the data to them. The staff would also give classes on how to use the data along in some open source statistical packages to do useful work. Good examples of curated data that I know are IPUMS and the Portland Metro's RLIS (both google-able).

4 comments

jfager 6104 days ago

I don't understand what you're getting at with this list.

1. Yes, datasets need to be cleaned. But you need to have the dataset before you can clean it, and different people will want to clean it in different ways. Get it up there first, and keep the political debates confined to the gathering methods. Griping about raw datasets only gives them an excuse to keep delaying putting anything out (in other words, this critique is actively harming the movement, please stop making it).

2. I don't understand what you mean by this. If a link points to a high-quality dataset that's otherwise hard to find, then it's very valuable.

3. Not all data is expressible in tab-delimited ascii tables. I'd like my SEC filings in well-structured XML, for instance.

4. This is a strawman. Nobody serious has ever said a good data set is easy to use and understand.

5. Ironically, this is the one point you make I agree with, and then you claim it doesn't apply to data.gov. I think this is actually the worst thing about data.gov right now, that they think they're giving us anything when they post their little summaries. Give us the raw data, please.

6. Isn't this just restating a combination of #1 and #3? Yes, big clean monolithic data sets are nice, but the priority is getting access to the data in the first place.

7. You're restating #4, which was a strawman.