Hacker News new | ask | show | jobs
by charlieo88 1075 days ago
I wish I had the time or facility to take a snapshot of wikipedia now before the imminent deluge of Chat-GPT based updates that start materially modifying wikipedia is some weird and unpredictable manner.
12 comments

In late 2021 / early 2022 I got scared about the incoming consequences of LLMs and downloaded all the "Kiwix" archives I could find, including Wikipedia, a bunch of other Wikimedia sites, Stack Overflow, etc.

I'm pretty glad that I did. I'm going to hold onto them indefinitely. They have become the "low background steel" of text.

I really like that analogy.

For anyone curious what low background steel is, it's steel that was made before the first atomic bombs were tested: https://en.m.wikipedia.org/wiki/Low-background_steel

> In late 2021 / early 2022 I got scared about the incoming consequences of LLMs and downloaded all the "Kiwix" archives I could find, including Wikipedia, a bunch of other Wikimedia sites, Stack Overflow, etc.

> I'm pretty glad that I did. I'm going to hold onto them indefinitely. They have become the "low background steel" of text.

Also, ironically, the Pushshift reddit dumps (still available via torrent), before they were taken down. The exact time Reddit shut down the API to sell their data for AI training is also exactly the time it started to become less valuable for that.

I believe a lot of subreddits started implementing protest moderation policies after reddit came down on the blackout. IMHO, they should implement rules like "no posts unless it's a ChatGPT hallucination."

Link to the torrent for science ?
Wikipedia doesn’t remove the old versions.

Otherwise you can find an archive there: https://archive.org/details/wikimediadownloads?and%5B%5D=sub...

Wikipedia has released snapshots available for download for over a decade now including ones with full edit histories, meaning you can just revert all edits to before a chosen epoch.
You can download a full archive already.

edit, link: https://en.wikipedia.org/wiki/Wikipedia:Database_download

Without article history and videos, it's small enough that many modern smartphones can have a local offline copy.

http://kiwix.org/

I'm unsure if this will happen. There's plenty of checks-and-balances for Wikipedia edits. There's automated spam detection, editors manually looking over edits for articles on their watchlist, editors who look over subtopics, and even editors that take a look at the general stream of edits. It's already possible to flag mass edits. As for whether ChatGPT will inflect the subtle tone and bias of edits made using it, that's the same as bias from human users. And the same mechanisms for dealing with human bias apply here.

In terms of practical utility, for the vast majority of humanity, access to translated articles in their local language is the biggest problem, I think. There is no Yoruba-language Wiki article on General Relativity, for example. Second comes entire biased communities - like some of the smaller Wikis are full of far-right editors, and most editors (like 90%) are men.

I can see AI bots submitting convincing edits at random times in no particular pattern. Eventually they will overwhelm Wikipedia checks and balances.
>> I wish I had the time or facility to take a snapshot of wikipedia now before the imminent deluge of Chat-GPT based updates that start materially modifying wikipedia is some weird and unpredictable manner.

> I'm unsure if this will happen. There's plenty of checks-and-balances for Wikipedia edits.

I think it will. It's so tedious to edit Wikipedia (due to bureaucracy and internal politics) that their editorial population is in a long-term decline, which means their oversight ability is declining too.

Probably what will happen is LLM generated content will creep into long-tail articles, then work its way into more "medium-profile" articles as editors get exhausted. The extremely high-profile stuff (e.g. New York City), political battleground articles (e.g. Donald Trump), and areas patrolled by obsessives (railroads, Pokemon) will probably remain unaffected by the corruption the longest. At some point, the only way to resist will to become much more hostile to new editors, but that's also long-term suicide for the project.

I think they're painted into a corner.

I mean, maybe. AI on the "good side" will also improve. It should be possible to check a sentence against its reference with LLMs. And anything not sourced is suspect, just as it is now.

I also don't like the attitude of Wikipedia being "them", as in "their editorial population". It's our public good, like our air, and everyone should care to ensure its high quality. If you see a problem in the world, you have to try to fix it, instead of sitting on the sidelines, looking from the outside in.

"As of 2 July 2023, the size of the current version of all articles compressed is about 22.14 GB without media." - https://en.m.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
The Wikipedia community has generally been pretty resistant to allowing fully AI-based tools in. We've had tools such as Lsjbot (https://en.wikipedia.org/wiki/Lsjbot) in the past, but they've failed to gain community consent on any of the large Wikipedias. If someone tries to bring an LLM-based tool to Wikipedia, it would take a lot of finesse to have any shot of the community allowing it.
I don't think it takes much finesse to just randomly start "improving" articles using the output of an LLM. It only takes a single well meaning yet misguided person. Remember this? https://www.theguardian.com/uk-news/2020/aug/26/shock-an-aw-...
Yeah, definitely a potential problem on the smaller language Wikipedias as the Scots Wikipedia incident shows, but for the big ones, low-quality content from new editors is not really a new problem to deal with.
But what about the whole mass of tech bros who don't understand what LLM's are (random text generators and nothing more), and manually start to add changes? It's a virus polluting every industry.
Wikipedia dumps are publicly available, both from themselves and from the Internet archives.

There’s no “time or facility” constraint, only storage space.

The wikipedia politburo already makes it impossible for normies to edit any wikipedia article worth editing. If you don't believe me, try it out with a stopwatch to see how long it takes for your edit to be reverted.
That you call them a 'politburo' and refer to 'normies' gives an indication that the types of edits you were making were neither well sourced nor neutral.

I've never had an edit reverted on Wikipedia.

> the types of edits you were making were neither well sourced nor neutral.

There are a lot of such edits at Wikipedia (neither well sourced nor neutral). For some reasons, a certain bias passes through the filter.

You can torrent a copy of Wikipedia, including article history. Locally, you can go back to any revision of any article you want. I keep a copy locally just because it seems something valuable to have.
You can use Kiwix too as an easy way to get an archive of it