I wish I had the time or facility to take a snapshot of wikipedia now before the imminent deluge of Chat-GPT based updates that start materially modifying wikipedia is some weird and unpredictable manner.
In late 2021 / early 2022 I got scared about the incoming consequences of LLMs and downloaded all the "Kiwix" archives I could find, including Wikipedia, a bunch of other Wikimedia sites, Stack Overflow, etc.
I'm pretty glad that I did. I'm going to hold onto them indefinitely. They have become the "low background steel" of text.
> In late 2021 / early 2022 I got scared about the incoming consequences of LLMs and downloaded all the "Kiwix" archives I could find, including Wikipedia, a bunch of other Wikimedia sites, Stack Overflow, etc.
> I'm pretty glad that I did. I'm going to hold onto them indefinitely. They have become the "low background steel" of text.
Also, ironically, the Pushshift reddit dumps (still available via torrent), before they were taken down. The exact time Reddit shut down the API to sell their data for AI training is also exactly the time it started to become less valuable for that.
I believe a lot of subreddits started implementing protest moderation policies after reddit came down on the blackout. IMHO, they should implement rules like "no posts unless it's a ChatGPT hallucination."
Wikipedia has released snapshots available for download for over a decade now including ones with full edit histories, meaning you can just revert all edits to before a chosen epoch.
I'm unsure if this will happen. There's plenty of checks-and-balances for Wikipedia edits. There's automated spam detection, editors manually looking over edits for articles on their watchlist, editors who look over subtopics, and even editors that take a look at the general stream of edits. It's already possible to flag mass edits. As for whether ChatGPT will inflect the subtle tone and bias of edits made using it, that's the same as bias from human users. And the same mechanisms for dealing with human bias apply here.
In terms of practical utility, for the vast majority of humanity, access to translated articles in their local language is the biggest problem, I think. There is no Yoruba-language Wiki article on General Relativity, for example. Second comes entire biased communities - like some of the smaller Wikis are full of far-right editors, and most editors (like 90%) are men.
>> I wish I had the time or facility to take a snapshot of wikipedia now before the imminent deluge of Chat-GPT based updates that start materially modifying wikipedia is some weird and unpredictable manner.
> I'm unsure if this will happen. There's plenty of checks-and-balances for Wikipedia edits.
I think it will. It's so tedious to edit Wikipedia (due to bureaucracy and internal politics) that their editorial population is in a long-term decline, which means their oversight ability is declining too.
Probably what will happen is LLM generated content will creep into long-tail articles, then work its way into more "medium-profile" articles as editors get exhausted. The extremely high-profile stuff (e.g. New York City), political battleground articles (e.g. Donald Trump), and areas patrolled by obsessives (railroads, Pokemon) will probably remain unaffected by the corruption the longest. At some point, the only way to resist will to become much more hostile to new editors, but that's also long-term suicide for the project.
I mean, maybe. AI on the "good side" will also improve. It should be possible to check a sentence against its reference with LLMs. And anything not sourced is suspect, just as it is now.
I also don't like the attitude of Wikipedia being "them", as in "their editorial population". It's our public good, like our air, and everyone should care to ensure its high quality. If you see a problem in the world, you have to try to fix it, instead of sitting on the sidelines, looking from the outside in.
The Wikipedia community has generally been pretty resistant to allowing fully AI-based tools in. We've had tools such as Lsjbot (https://en.wikipedia.org/wiki/Lsjbot) in the past, but they've failed to gain community consent on any of the large Wikipedias. If someone tries to bring an LLM-based tool to Wikipedia, it would take a lot of finesse to have any shot of the community allowing it.
Yeah, definitely a potential problem on the smaller language Wikipedias as the Scots Wikipedia incident shows, but for the big ones, low-quality content from new editors is not really a new problem to deal with.
But what about the whole mass of tech bros who don't understand what LLM's are (random text generators and nothing more), and manually start to add changes? It's a virus polluting every industry.
The wikipedia politburo already makes it impossible for normies to edit any wikipedia article worth editing. If you don't believe me, try it out with a stopwatch to see how long it takes for your edit to be reverted.
That you call them a 'politburo' and refer to 'normies' gives an indication that the types of edits you were making were neither well sourced nor neutral.
You can torrent a copy of Wikipedia, including article history. Locally, you can go back to any revision of any article you want. I keep a copy locally just because it seems something valuable to have.
I'm pretty glad that I did. I'm going to hold onto them indefinitely. They have become the "low background steel" of text.