Hacker News new | ask | show | jobs
by gwern 4356 days ago
At least part of the problem is that he's generating what one might call 'info trash': he's taking highly structured information from databases, and turning it into natural-language prose, a data source of less value since it's less structured.

These prose versions are now going to steadily fall out of sync with the original databases, be much more prominent in Wikipedia and Google, diverge from each other, be harder to parse and perform any complex analysis on (a database is at least relatively comprehensible, but to parse his dumps you have to hope you can reverse-engineer it, no other bots or editors have modified it much, and that he didn't get clever with his format strings), etc. If at some point one wanted to change something about the presentation, it's no longer a matter of editing one template and now the user-friendly HTML view onto the database is automatically updated for all viewers, now one has to run a carefully-written bot on millions of articles (and since that is beyond semi-automated bots, you have to have special permission to run it).

It would have been better to work on merging databases or exporting them into a structured site, something like Freebase.

5 comments

Sometimes I appreciate what you call "info trash." For example, I assume there is a bot that turns census data into articles for every incorporated community in the US, like this: http://en.wikipedia.org/wiki/Agency,_Missouri.

I still think the article is useful as is, with just the map, data sheet, and demographics, and of course many incorporations have additional human-composed information added.

I could imagine some more structured data source, where the main article redirects to a table and scrolls to the correct spot. I would be fine with that, but as far as I know that concept doesn't exist on Wikipedia.

> I could imagine some more structured data source, where the main article redirects to a table and scrolls to the correct spot. I would be fine with that, but as far as I know that concept doesn't exist on Wikipedia.

The structured data source now exists, but how to present its data is still being worked out. You can add information to it now, since the goal is to collect a bunch of structured information and then incrementally figure out how to display it, either on its own or integrated into Wikipedia articles (bot additions here are also very welcome): https://www.wikidata.org

I believe there's going to be some offloading of some structured Wikipedia information so it pulls from Wikidata in the future, instead of being maintained "manually" in articles. For example the geotags that are currently buried in Wikipedia articles' markup will probably be centralized to Wikidata soonish and just pulled from there to display. And infoboxes may be auto-populated from the Wikidata information as well. Sending people to auto-generated stub articles when a "real" article is missing is an interesting idea that might happen longer-term.

Looking at the history of that page,[0] it appears a couple of different bots have worked on it, with human intervention. (I suspect ``Ram-Man'' is an earlier version of ``Rambot'', but I could be wrong.)

I've read pages like that before, and it never once occurred to me that they were anything other than the result of sheer human bloody-mindedness. They're not `exciting', but they're very clearly written in an easily parseable way that doesn't scream ``machine-generated'' to me. If this is indicative, the quality of output of these bots is excellent, and a good use of automation --- let the bots fill out the dry factual stuff, and the humans write the less tangible, non-statistical stuff.

[0]: http://en.wikipedia.org/w/index.php?title=Agency,_Missouri&a...

Ram-Man is a human account. The same person operates the Rambot bot account. You can click on their usernames on the history page to see their user pages, which usually describe these things.
But now you are arguing the "it would be better if you did Y instead of X, so therefore stop doing X" fallacy. It's not like he is pillaging the original databases and leaves them burning. They are still there and his copying of data from them doesn't make them worse. You or anyone else are welcome to spend your free time exporting and merging the databases into Freebase.
> But now you are arguing the "it would be better if you did Y instead of X, so therefore stop doing X" fallacy.

That's not a fallacy, that's just good advice. If this Swedish dude wants praise, he should spend his time doing things which are genuinely good, not dubious and possibly net-negative in the long run.

Some people (a fairly significant number) find it much easier to parse information presented in English sentences as opposed to the presentation forms typical in structured data (often table form, with some kind of filtering).

In many cases, it is very rare for facts like those presented in botanical databases to change: it often means a plant has been recategorized, which is a non-trivial thing to do. It is entirely appropriate for this to be handled manually, given how rare it is.

Your arguments about it being better to work on a structured version present a false dichotomy: it isn't Wikipedia OR a structured version, it is Wikipedia AND structured data that need to be improved.

>It would have been better to work on merging databases or exporting them into a structured site, something like Freebase.

Why is this not appropriate for Wikidata?

I was trying to think of Wikidata's name but a search didn't bring up what I was looking for, so I simply used the database site whose name I could recall (because it's always amused me).
Except that no one uses Freebase.