| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by charlieo88 1075 days ago
	I wish I had the time or facility to take a snapshot of wikipedia now before the imminent deluge of Chat-GPT based updates that start materially modifying wikipedia is some weird and unpredictable manner.

12 comments

LeoPanthera 1075 days ago

In late 2021 / early 2022 I got scared about the incoming consequences of LLMs and downloaded all the "Kiwix" archives I could find, including Wikipedia, a bunch of other Wikimedia sites, Stack Overflow, etc.

I'm pretty glad that I did. I'm going to hold onto them indefinitely. They have become the "low background steel" of text.

link

samwillis 1075 days ago

I really like that analogy.

For anyone curious what low background steel is, it's steel that was made before the first atomic bombs were tested: https://en.m.wikipedia.org/wiki/Low-background_steel

link

tivert 1075 days ago

> In late 2021 / early 2022 I got scared about the incoming consequences of LLMs and downloaded all the "Kiwix" archives I could find, including Wikipedia, a bunch of other Wikimedia sites, Stack Overflow, etc.

> I'm pretty glad that I did. I'm going to hold onto them indefinitely. They have become the "low background steel" of text.

Also, ironically, the Pushshift reddit dumps (still available via torrent), before they were taken down. The exact time Reddit shut down the API to sell their data for AI training is also exactly the time it started to become less valuable for that.

I believe a lot of subreddits started implementing protest moderation policies after reddit came down on the blackout. IMHO, they should implement rules like "no posts unless it's a ChatGPT hallucination."

link

bitcoinmoney 1074 days ago

Link to the torrent for science ?

link

speedgoose 1075 days ago

Wikipedia doesn’t remove the old versions.

Otherwise you can find an archive there: https://archive.org/details/wikimediadownloads?and%5B%5D=sub...

link

hughesjj 1075 days ago

Wikipedia has released snapshots available for download for over a decade now including ones with full edit histories, meaning you can just revert all edits to before a chosen epoch.

link

bombela 1075 days ago

You can download a full archive already.

edit, link: https://en.wikipedia.org/wiki/Wikipedia:Database_download

link

int_19h 1075 days ago

Without article history and videos, it's small enough that many modern smartphones can have a local offline copy.

http://kiwix.org/

link

pradn 1075 days ago

I'm unsure if this will happen. There's plenty of checks-and-balances for Wikipedia edits. There's automated spam detection, editors manually looking over edits for articles on their watchlist, editors who look over subtopics, and even editors that take a look at the general stream of edits. It's already possible to flag mass edits. As for whether ChatGPT will inflect the subtle tone and bias of edits made using it, that's the same as bias from human users. And the same mechanisms for dealing with human bias apply here.

In terms of practical utility, for the vast majority of humanity, access to translated articles in their local language is the biggest problem, I think. There is no Yoruba-language Wiki article on General Relativity, for example. Second comes entire biased communities - like some of the smaller Wikis are full of far-right editors, and most editors (like 90%) are men.

link

worrycue 1075 days ago

I can see AI bots submitting convincing edits at random times in no particular pattern. Eventually they will overwhelm Wikipedia checks and balances.

link

tivert 1075 days ago

>> I wish I had the time or facility to take a snapshot of wikipedia now before the imminent deluge of Chat-GPT based updates that start materially modifying wikipedia is some weird and unpredictable manner.

> I'm unsure if this will happen. There's plenty of checks-and-balances for Wikipedia edits.

I think it will. It's so tedious to edit Wikipedia (due to bureaucracy and internal politics) that their editorial population is in a long-term decline, which means their oversight ability is declining too.

Probably what will happen is LLM generated content will creep into long-tail articles, then work its way into more "medium-profile" articles as editors get exhausted. The extremely high-profile stuff (e.g. New York City), political battleground articles (e.g. Donald Trump), and areas patrolled by obsessives (railroads, Pokemon) will probably remain unaffected by the corruption the longest. At some point, the only way to resist will to become much more hostile to new editors, but that's also long-term suicide for the project.

I think they're painted into a corner.

link

pradn 1074 days ago

I mean, maybe. AI on the "good side" will also improve. It should be possible to check a sentence against its reference with LLMs. And anything not sourced is suspect, just as it is now.

I also don't like the attitude of Wikipedia being "them", as in "their editorial population". It's our public good, like our air, and everyone should care to ensure its high quality. If you see a problem in the world, you have to try to fix it, instead of sitting on the sidelines, looking from the outside in.

link

deepserket 1075 days ago

"As of 2 July 2023, the size of the current version of all articles compressed is about 22.14 GB without media." - https://en.m.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

link

hayleox 1075 days ago

The Wikipedia community has generally been pretty resistant to allowing fully AI-based tools in. We've had tools such as Lsjbot (https://en.wikipedia.org/wiki/Lsjbot) in the past, but they've failed to gain community consent on any of the large Wikipedias. If someone tries to bring an LLM-based tool to Wikipedia, it would take a lot of finesse to have any shot of the community allowing it.

link

bruce343434 1075 days ago

I don't think it takes much finesse to just randomly start "improving" articles using the output of an LLM. It only takes a single well meaning yet misguided person. Remember this? https://www.theguardian.com/uk-news/2020/aug/26/shock-an-aw-...

link

hayleox 1075 days ago

Yeah, definitely a potential problem on the smaller language Wikipedias as the Scots Wikipedia incident shows, but for the big ones, low-quality content from new editors is not really a new problem to deal with.

link

1270018080 1075 days ago

But what about the whole mass of tech bros who don't understand what LLM's are (random text generators and nothing more), and manually start to add changes? It's a virus polluting every industry.

link

masklinn 1075 days ago

Wikipedia dumps are publicly available, both from themselves and from the Internet archives.

There’s no “time or facility” constraint, only storage space.

link

Der_Einzige 1075 days ago

The wikipedia politburo already makes it impossible for normies to edit any wikipedia article worth editing. If you don't believe me, try it out with a stopwatch to see how long it takes for your edit to be reverted.

link

_djo_ 1075 days ago

That you call them a 'politburo' and refer to 'normies' gives an indication that the types of edits you were making were neither well sourced nor neutral.

I've never had an edit reverted on Wikipedia.

link

hulitu 1074 days ago

> the types of edits you were making were neither well sourced nor neutral.

There are a lot of such edits at Wikipedia (neither well sourced nor neutral). For some reasons, a certain bias passes through the filter.

link

cheald 1075 days ago

You can torrent a copy of Wikipedia, including article history. Locally, you can go back to any revision of any article you want. I keep a copy locally just because it seems something valuable to have.

link

ravetcofx 1075 days ago

You can use Kiwix too as an easy way to get an archive of it

link