| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jfager 6105 days ago

I don't understand what you're getting at with this list.

1. Yes, datasets need to be cleaned. But you need to have the dataset before you can clean it, and different people will want to clean it in different ways. Get it up there first, and keep the political debates confined to the gathering methods. Griping about raw datasets only gives them an excuse to keep delaying putting anything out (in other words, this critique is actively harming the movement, please stop making it).

2. I don't understand what you mean by this. If a link points to a high-quality dataset that's otherwise hard to find, then it's very valuable.

3. Not all data is expressible in tab-delimited ascii tables. I'd like my SEC filings in well-structured XML, for instance.

4. This is a strawman. Nobody serious has ever said a good data set is easy to use and understand.

5. Ironically, this is the one point you make I agree with, and then you claim it doesn't apply to data.gov. I think this is actually the worst thing about data.gov right now, that they think they're giving us anything when they post their little summaries. Give us the raw data, please.

6. Isn't this just restating a combination of #1 and #3? Yes, big clean monolithic data sets are nice, but the priority is getting access to the data in the first place.

7. You're restating #4, which was a strawman.

2 comments

elblanco 6103 days ago

Well structured XML is almost impossible to beat as a data interchange format (since it was designed for that)...if you can't load XML, a format that's been around since the 1990's, you are using the wrong tools.

link

wsprague 6104 days ago

OK, we disagree. Except that #4 IS sort of redundant, though I want to make the point that data is almost impossible for a layperson, and still really hard for a practiced analyst.

link

jfager 6104 days ago

I actually meant it when I said I didn't understand what you were getting at. I initially read it as you saying that there shouldn't be a data.gov at all (because raw data's useless, curated data's expensive and difficult, and simplified data summaries are likely to be misinterpreted by lay people), but that can't be right, so I'm really curious what you were actually trying to say. What would an ideal data.gov look like, to you?

link

wsprague 6104 days ago

I moved my reply to my top comment and screwed up the reply tree here. So this is for the comment that follows this one:

data.gov is fundamentally flawed, and won't be anything but annoying until it is reworked into something along the lines of what I suggest. Or so I think...

link