|
|
|
|
|
by Philpax
413 days ago
|
|
Yeah, it's a bit confusing at first to navigate. Luckily, they offer XML dumps that aren't too bad to work with: 1. Go to https://dumps.wikimedia.org/enwiki/latest/ (or a date of your choice in /enwiki) 2. Download https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page... and https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.... The first file is a bz2-multistream-compressed dump of a XML containing all of English Wikipedia's text, while the second file is an index to make it easier to find specific articles. 3. You can either: a. unpack the first file
b. use the second file to locate specific articles within the first file; it maps page title -> file offset for the relevant bz2 stream
c. use a streaming decoder to process the entire Wiki without ever decompressing it wholly
4. Once you have the XML, getting at the actual text isn't too difficult; you should use a streaming XML decoder to avoid as much allocation as possible when processing this much data.The XML contains pages like this: <page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility" />
<revision>
<id>1219062925</id>
<parentid>1219062840</parentid>
<timestamp>2024-04-15T14:38:04Z</timestamp>
<contributor>
<username>Asparagusus</username>
<id>43603280</id>
</contributor>
<comment>Restored revision 1002250816 by [[Special:Contributions/Elli|Elli]] ([[User talk:Elli|talk]]): Unexplained redirect breaking</comment>
<origin>1219062925</origin>
<model>wikitext</model>
<format>text/x-wiki</format>
<text bytes="111" sha1="kmysdltgexdwkv2xsml3j44jb56dxvn" xml:space="preserve">#REDIRECT [[Computer accessibility]]
{{rcat shell|
{{R from move}}
{{R from CamelCase}}
{{R unprintworthy}}
}}</text>
<sha1>kmysdltgexdwkv2xsml3j44jb56dxvn</sha1>
</revision>
</page>
so all you need to do is get at the `text`. |
|
I know there are now a couple pretty-good wikitext parsers, but for years, it was a bigger problem. The only "official" one was the huge php app itself.