Hacker News new | ask | show | jobs
by dkhar 4356 days ago
I'm not very in-tune with Wikipedia's culture (which, I've read, is very nuanced and rigid[1]), but I really don't see why this is a bad thing, given the information in the articles is accurate (and the article gives the impression that glitches are rare).

If nobody else was going to create an article about some species of butterfly, I don't see why adding that information would be harmful to Wikipedia. Does it make Wikipedia harder to read? Harder to search?

I don't think "it's not written by a human" is a valid argument for factual information, and I've never seen any evidence to suggest that it should be one.

EDIT: I found this bot's edit log! https://sv.wikipedia.org/w/index.php?title=Special:Logg/Lsjb...

Here are a few articles randomly picked out of the latest 1000:

https://sv.wikipedia.org/wiki/Urochloa_plantaginea

https://sv.wikipedia.org/wiki/Brachiaria_vittata

https://sv.wikipedia.org/wiki/Eutriana_repens

https://sv.wikipedia.org/wiki/Andropogon_decipiens

After looking at these, I'm beginning to see why there is some backlash. There are literally thousands of articles here that read "X is a species of grass. It got its name from Y and is described in Z catalog." The only people who would need this information are botanists, and they already have their own specialized sources. I'm still not against bot-produced content, but I understand why some people oppose initiatives like this.

[1] http://www.gwern.net/In%20Defense%20Of%20Inclusionism

11 comments

At least part of the problem is that he's generating what one might call 'info trash': he's taking highly structured information from databases, and turning it into natural-language prose, a data source of less value since it's less structured.

These prose versions are now going to steadily fall out of sync with the original databases, be much more prominent in Wikipedia and Google, diverge from each other, be harder to parse and perform any complex analysis on (a database is at least relatively comprehensible, but to parse his dumps you have to hope you can reverse-engineer it, no other bots or editors have modified it much, and that he didn't get clever with his format strings), etc. If at some point one wanted to change something about the presentation, it's no longer a matter of editing one template and now the user-friendly HTML view onto the database is automatically updated for all viewers, now one has to run a carefully-written bot on millions of articles (and since that is beyond semi-automated bots, you have to have special permission to run it).

It would have been better to work on merging databases or exporting them into a structured site, something like Freebase.

Sometimes I appreciate what you call "info trash." For example, I assume there is a bot that turns census data into articles for every incorporated community in the US, like this: http://en.wikipedia.org/wiki/Agency,_Missouri.

I still think the article is useful as is, with just the map, data sheet, and demographics, and of course many incorporations have additional human-composed information added.

I could imagine some more structured data source, where the main article redirects to a table and scrolls to the correct spot. I would be fine with that, but as far as I know that concept doesn't exist on Wikipedia.

> I could imagine some more structured data source, where the main article redirects to a table and scrolls to the correct spot. I would be fine with that, but as far as I know that concept doesn't exist on Wikipedia.

The structured data source now exists, but how to present its data is still being worked out. You can add information to it now, since the goal is to collect a bunch of structured information and then incrementally figure out how to display it, either on its own or integrated into Wikipedia articles (bot additions here are also very welcome): https://www.wikidata.org

I believe there's going to be some offloading of some structured Wikipedia information so it pulls from Wikidata in the future, instead of being maintained "manually" in articles. For example the geotags that are currently buried in Wikipedia articles' markup will probably be centralized to Wikidata soonish and just pulled from there to display. And infoboxes may be auto-populated from the Wikidata information as well. Sending people to auto-generated stub articles when a "real" article is missing is an interesting idea that might happen longer-term.

Looking at the history of that page,[0] it appears a couple of different bots have worked on it, with human intervention. (I suspect ``Ram-Man'' is an earlier version of ``Rambot'', but I could be wrong.)

I've read pages like that before, and it never once occurred to me that they were anything other than the result of sheer human bloody-mindedness. They're not `exciting', but they're very clearly written in an easily parseable way that doesn't scream ``machine-generated'' to me. If this is indicative, the quality of output of these bots is excellent, and a good use of automation --- let the bots fill out the dry factual stuff, and the humans write the less tangible, non-statistical stuff.

[0]: http://en.wikipedia.org/w/index.php?title=Agency,_Missouri&a...

Ram-Man is a human account. The same person operates the Rambot bot account. You can click on their usernames on the history page to see their user pages, which usually describe these things.
But now you are arguing the "it would be better if you did Y instead of X, so therefore stop doing X" fallacy. It's not like he is pillaging the original databases and leaves them burning. They are still there and his copying of data from them doesn't make them worse. You or anyone else are welcome to spend your free time exporting and merging the databases into Freebase.
> But now you are arguing the "it would be better if you did Y instead of X, so therefore stop doing X" fallacy.

That's not a fallacy, that's just good advice. If this Swedish dude wants praise, he should spend his time doing things which are genuinely good, not dubious and possibly net-negative in the long run.

Some people (a fairly significant number) find it much easier to parse information presented in English sentences as opposed to the presentation forms typical in structured data (often table form, with some kind of filtering).

In many cases, it is very rare for facts like those presented in botanical databases to change: it often means a plant has been recategorized, which is a non-trivial thing to do. It is entirely appropriate for this to be handled manually, given how rare it is.

Your arguments about it being better to work on a structured version present a false dichotomy: it isn't Wikipedia OR a structured version, it is Wikipedia AND structured data that need to be improved.

>It would have been better to work on merging databases or exporting them into a structured site, something like Freebase.

Why is this not appropriate for Wikidata?

I was trying to think of Wikidata's name but a search didn't bring up what I was looking for, so I simply used the database site whose name I could recall (because it's always amused me).
Except that no one uses Freebase.
I think the problem is that the entries are very low-quality and are being produced in such great quantities that it will be hard for anyone to turn them into passable articles.

The purpose of Wikipedia is not to be a collection of all the factual information it gather. You or I might wish for it to be one, but that isn't what its creators mean for it to be. Each individual article is expected to meet Wikipedia's guidelines for relevance and to have some minimal level of quality. If it isn't possible to write a good, encyclopedic article on a topic, Wikipedia's stance is generally that it should be deleted (or, if the article is just overly specific and the information is relevant to a broader topic, the article might be incorporated into a section in a more general article).

I think these stubs have the benefit of reducing the barrier to entry for future contributors. If I search for something and find a stub, I can easily throw in even one sentence, and the article is incrementally improved. Whereas if there's no article, I am much less inclined to make a new one.
Exactly: creating a new Wikipedia page is pretty intimidating because most people do not understand WP's template system or how to make an infobox.
> The only people who would need this information are botanists

I think that looking up plant species are exactly the kind of thing people would want to use an encyclopedia for.

These are (almost) one-sentence articles. They're not appropriate for an encyclopaedia. If anything, they should be in a 'list' article format.

If you want these kind of one-sentence descriptors, they would be better served in a specialist publication.

Why isn't it appropriate for an encyclopaedia?

For example, the 1910 Encyclopaedia Britannica's complete entry for "denim" is "(an abbreviation of the serge de Nimes), the name originally given to a kind of serge. It is now applies to a stout twilled cloth made in various colours, usually of cotton, and used for overalls, &c."

The entry for "Gimli" is "In Scandinavian mythology, the great hall of heaven whither the righteous will go to spend eternity."

It's not hard to find more examples. But I don't think people considered the EB less of an encyclopaedia for its use of one-sentence descriptors.

The heart of the matter is that there's precious little difference between a dictionary and an encyclopedia. Indeed, the EB's full name is "The Encyclopædia Britannica: a Dictionary of Arts, Sciences, Literature and General Information"

To double check that it's not limited to the EB, I looked in Harmsworth's Universal encyclopedia. The entry for "fulcrum" is "(Lat. fulcrum, a prop) Fixed point in the mechanical system of a lever about which the lever can rotate. See Lever." The entry for "gumboil" is "Small abscess on the gum arising in most cases from decay at the root of a tooth."

See http://menvall.wordpress.com/2010/09/14/on-wikipedias-attemp... for an analysis of the distinction between the two, and the conclusion that "everything that is included in a dictionary also can be included in an encyclopedia, whereas all that is included in an encyclopedia either can or can’t be included in a dictionary. This relation is, however, completely misunderstood by some editors of Wikipedia."

(Had you written that it wasn't appropriate for Wikipedia, than that's a different issue. I speak now only of the broad category of "encyclopedia".)

Triple-checking, the entry for "gumboil" in Wikipedia (at http://en.wikipedia.org/wiki/Gumboil) redirects to "Intraoral dental sinus". The complete entry is two sentences long:

> Intraoral dental sinus (also termed a parulis and commonly, a gumboil) is an oral lesion characterized by a soft erythematous papule (red spot) that develops on the alveolar process in association with a non-vital tooth and accompanying dental abscess.[1] A parulis is made up of inflamed granulation tissue.

By your definition, this "(almost) one-sentence" article should be removed from WP, no?

Firstly, just because you can find brief entries in other encyclopaedias doesn't mean that it's good form.

Secondly, I didn't demand complete excision with the kind of frothy fervor you're implying. I said a better format for these "X is type Y, discovered by Z, listed in Q" is collating them all in a list format. WP has list article like this aplenty - dense, easily digestible information on similar topics, allowing quick and easy comparison and scanning.

As for your link, one of the bold highlights is "explains subjects in greater detail than a dictionary". Another of the three definitions of 'encyclopaedia' your link provides says "with data on and discussion of each subject identified" (my emphasis). So that's two out of three definitions that quite strongly indicate non-brief articles - your linked article is wrong from it's own source material, and hasn't made the case that dictionary-like brevity is suitable for an encyclopaedia.

By your definition, this "(almost) one-sentence" article should be removed from WP, no?

What, are you trying to 'catch me out' here? Do you think that's a good quality article? It's a stub, it's not what WP wants to encourage, and it's more like a dictionary definition than either "explaining a subject in greater detail" or "discussion of the subject". Yes, I think it's a bad article for any encyclopaedia - it's quite brief, and full of technical jargon. If you didn't already know the specific jargon, it's completely useless as a "general course of instruction" (the etymology argument from your link). And if you do know the jargon, you have a pretty good chance of working it out from the name alone; the article merely confirms the topic if you're unsure, but you don't get any more insight into it.

As 'trick questions' go, this one sucked.

Trick question? I'm showing that my question - "Why isn't it appropriate for an encyclopaedia?" - is meaningful, by giving counter-examples from three encyclopedias. This suggests that your definition is not aligned with how the term is used in practice.

I ask that you clarify your reasoning.

You say my linked-to reference "hasn't made the case that dictionary-like brevity is suitable for an encyclopaedia". The link isn't trying to make that delineation between the two. It's arguing (and I agree) that a dictionary is a type of encyclopedia, not that they are two different things. You mentioned some quotes, in bold. The author later comments on those exact same quotes (with bold translated to italics):

> These definitions show that whereas dictionary is defined by words alone: “reference work that lists words, usually in alphabetical order, and gives their meanings and often other information such as pronunciations, etymologies, and variant spellings“, encyclopedia is defined either as synonymous to dictionary: “the term is often interchanged with the word “dictionary,” as in the present work” or by a larger extension than dictionary: “explains subjects in greater detail than a dictionary”. There is thus no conflict between dictionary and encyclopedia. They are either synonymous or only have different extensions (i.e., encyclopedia including dictionary, but covering a larger set of phenomena).

I checked with the OED, at http://www.oed.com/viewdictionaryentry/Entry/52325 . It concurs, since its definition 1b. for dictionary is (italics mine):

> In extended use: a book of information or reference on any subject in which the entries are arranged alphabetically; an alphabetical encyclopedia

Yes, I'm saying that the article for "gumboil" in WP is not a stub, does not need to be longer than it is, and very much like what WP should support. While I agree with you in that the older print definition of the term is easier to understand than what WP has, that's at most one more line, and more likely solved by rewriting.

BTW, I also looked up Gimlé in WP. That's three sentences long, so a full two sentences longer than the 'Gimli' entry in Encyclopaedia Britannica.

Why must everything require more than a few lines to fit into your concept of an encyclopedia? Certainly Gimlé doesn't fit in a dictionary, so where else would it go?

This idea that dictionaries and encyclopaedias are separable things is entirely within your head - it's an argument you're projecting onto me. I haven't mentioned the word 'dictionary' at all, with the exception of one quote from your own source. You're attributing to me an argument that I'm not making - I couldn't care less whether you call an encyclopaedia a 'dictionary', an 'encyclopaedia', or a 'sauerkraut sandwich' here. I haven't said anywhere "That should be in a dictionary"

I'm talking about the function of an encyclopaedia - which your own link has sources generally requiring non-brief articles. Articles which discuss and expand on a subject. Even the etymology provided is 'general education', which implies more than mere definition of a word.

Yes, super-short articles like 'gumboil' or 'Gimle' should be rolled into larger, more comprehensive articles. There is plenty you could add to gumboil - an image to show one, demographic preponderances, common treatments, common complicating factors, all of which enhance the user's knowledge of the topic. It certainly should be reduced or modified in terms of jargon. As for Gimlé, there's no reason why it can't be rolled into a more comprehensive article on Asgard, Norse Mythology, or whatever. Check out the 'Elysium' article for ways you can expand it to make it a more useful article in its own right.

Another thing that you're missing is that WP (and myself) both view these things as undesirable, but not so undesirable that they should be destroyed as a matter of course. They're just bad articles - and contrary to what you're saying, they're far from complete.

In the case of the 'grasses' links of the OP, these are absolutely terrible articles (the irony being that they're chaff - an appropriately grassy reference). Yes, it's information, but it's very poorly laid out and hard to access or compare. It's the absolute barest information - and far, far from "general education" substansiveness. Cool, Brachiaria plantaginea is a grass, but let's have a look at the entry:

Brachiaria plantaginea [1] is a species of grass which was first described by Heinrich Friedrich Link, and got its current name of Albert Spear Hitchcock. Brachiaria plantaginea included in the genus Brachiaria, and the grass family. [2] [3] No subspecies are listed in the Catalogue of Life. (ta, google translate)

There is barely any information here beyond "It's a grass". What kind of grass? Is it grass like crabgrass? Like asparagus? Like bamboo? What are it's characteristics? Where do you find it? Is it peculiar to any animal's diets? How does it propagate? What does it even look like? Does it have defense mechanisms? Does it survive arid climates well? Are humans allergic to it at all? Not to mention that it's self-evident in the name Brachiaria plantaginea that it's in the genus Brachiaria.

It's an awful, very low quality article - regardless of whether or not you think such information belongs in an encyclopaedia, the article quality does not. Do you feel generally educated by that article? Do you feel like the thing that is Brachiaria plantaginea has been sufficiently discussed? Is the article self-contained (ha!) and explained in detail? These three questions are fundamental parts of the definitions of 'encyclopaedia' given by your original link (and which I don't particularly contest - I rather agree with them).

If one-sentence articles are considered a problem, that seems like a reasonable choice for Wikipedia to make, but it would apply to human-written and bot-written articles.
It does apply, regardless of the source of the article. Wikipedia doesn't like stub articles.
Most articles started as stubs. At the moment many language versions of Wikipedia have different opinions about stubs.

e.g. german wikipedia don't allow stubs anymore and the admins delete & reverts more pages every day than new ones are created. It's maybe a cultural problem as such admins identify themself with 'their articles' and don't allow any changes.

The question of course is in how to determine when a short article is a stub. Some short entries are sufficiently complete for the purposes of WP, whose English guidelines say "there are some subjects about which very little can be written."

Consider http://en.wikipedia.org/wiki/Muati

> Muati is an obscure local god in the Sumerian pantheon. He is associated in some texts with the mythical island paradise of Dilmun, and becomes syncretised with Nabu.

That's unlikely to get much longer. For one, the "Dictionary of the Old Testament: Wisdom, Poetry & Writings ..." says "Muati, a god about whom we know very little."

Considering the fact that the rules and culture of Wikipedia seem to want people to write like bots, I don't see any issues with letting a real one do the writing for us!
> The only people who would need this information are botanists, and they already have their own specialized sources.

So, that's a nice benefit for botanists (particularly amateur or student botanists), with only a very minor cost imposed on everyone else (namely, the slight namespace pollution, but that's very unlikely to manifest itself). Sounds alright to me.

Reading the article, I think some of it is more of a "why?" than a "why not?", since the articles are factually correct but are mostly just lists of facts and not really something you couldn't find in any number of other resources.

For people who do serious article writing, I imagine this might be considered as a "cheapening" their work. For instance, I imagine some editors also resent the notion that encyclopedia editing is somehow reducible to plugging facts into the right templates. Of course the bot's authors don't really believe they are creating articles as high-quality as good human-edited entries, but the emotional reaction on the part of other editors is at least something I can comprehend.

I'm not really in tune with Wikipedia, so this is mostly conjecture.

edit: reverted edit, added first sentence of 2nd paragraph.

> "and not really something you couldn't find in any number of other resources."

The same should be able to be said about everything on Wikipedia, since Wikipedia is not supposed to have original research and should have a source for everything.

The "Why?" would then be "Because it is better to search for [obscure butterfly] and find a short list of fact than to search for that butterfly and find nothing at all."

Exactly.

A stub also lowers the barrier of entry for new users wanting to add an obscure butterfly they've just tracked down.

> The same should be able to be said about everything on Wikipedia

I don't think so. Well-written encyclopedic entries are in far shorter supply than bare lists of facts.

I guess the fundamental difference of opinion is between those who feel Wikipedia is an encyclopedia, and those who feel it's a dumping ground for human knowledge. Note that I'm not taking sides, just trying to explain the root causes for the difference of opinion.

Also, on a purely technical note, I very much doubt that you couldn't find the information in bot-generated articles anywhere else using a search engine. If that were the case, where are the bots getting the data?

Bots could be concatenating two or threes different sets to create one stub per butterfly.

Or bots could be taking something in a weird set of scans, OCRing that, and then putting it in a stub. This would be troubling unless there was a human checking the quality of the OCR.

There is plenty of stuff that is public domain and not online in a useful form.

Problem is that for any obscure organism you'll need to know what it is before you can search for it on WP. There is no ability to do that on WP, if the articles on the species has an image the chances are good that it will be a related species. Then there is the problem of some one coming along and adding mangled 'facts' to the article or 'facts' derived from 19th century works.
> ...Reading the article, I think some of it is more of a "why?" than a "why not?", since the articles are factually correct but are mostly just lists of facts and not really something you couldn't find in any number of other resources.

The argument I'd make for "why?" is that Wikipedia is more accessible and more reliably available than most other resources. I mean, if the government of the Philippines had a web-based, up-to-date list of towns with some basic information, it might make sense to offload the effort of maintaining that information to them. As it stands, though, not even the US has such a directory -- so Wikipedia picks up the slack (or at least it does for towns in the US).

I'm normally a big fan of including any and all accurate information on Wikipedia. However, with this many articles, I'd be concerned about the ability of any human editors to actually notice malicious misinformation. If a random person edited one of those arcticles to change an obscure fact to an incorrect statement, would anyone notice?
That's a problem Wikipedia has in general; improvements to Wikipedia's botany coverage are not going to meaningfully change it one way or the another.
I'm completely pro-bots but it does seem this one writes way too short articles with very little content. Bots need to do a similar-enough job to humans while being fast and reliable.

Not spam the wiki with names and basically no info.

> There are literally thousands of articles here that read "X is a species of grass. It got its name from Y and is described in Z catalog." The only people who would need this information are botanists, and they already have their own specialized sources.

Except that those articles seem to have good infoboxes. Such structured informations are very useful for many purpose. For instance it is used to build ontologies based on the data on dbpedia/wikidata. These datasets help constructing better semantic tools (like translators etc.). So it's still pretty useful.

Looking at [1], the following can be found listed under sources:

^ <![CDATA[Scribn. & Merr.]]>, 1901 In: Bull. Div. Agrostol. U.S.D.A. 24: 26

Maybe he should have done more testing before spamming Wikipedia with mess like that.

[1] https://sv.wikipedia.org/wiki/Eutriana_repens

So where is the deletionist bot to counter his spam?