| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Philpax 413 days ago

Yeah, it's a bit confusing at first to navigate. Luckily, they offer XML dumps that aren't too bad to work with:

1. Go to https://dumps.wikimedia.org/enwiki/latest/ (or a date of your choice in /enwiki)

2. Download https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page... and https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.... The first file is a bz2-multistream-compressed dump of a XML containing all of English Wikipedia's text, while the second file is an index to make it easier to find specific articles.

3. You can either:

  a. unpack the first file
  b. use the second file to locate specific articles within the first file; it maps page title -> file offset for the relevant bz2 stream
  c. use a streaming decoder to process the entire Wiki without ever decompressing it wholly

4. Once you have the XML, getting at the actual text isn't too difficult; you should use a streaming XML decoder to avoid as much allocation as possible when processing this much data.

The XML contains pages like this:

    <page>
      <title>AccessibleComputing</title>
      <ns>0</ns>
      <id>10</id>
      <redirect title="Computer accessibility" />
      <revision>
        <id>1219062925</id>
        <parentid>1219062840</parentid>
        <timestamp>2024-04-15T14:38:04Z</timestamp>
        <contributor>
          <username>Asparagusus</username>
          <id>43603280</id>
        </contributor>
        <comment>Restored revision 1002250816 by [[Special:Contributions/Elli|Elli]] ([[User talk:Elli|talk]]): Unexplained redirect breaking</comment>
        <origin>1219062925</origin>
        <model>wikitext</model>
        <format>text/x-wiki</format>
        <text bytes="111" sha1="kmysdltgexdwkv2xsml3j44jb56dxvn" xml:space="preserve">#REDIRECT [[Computer accessibility]]

    {{rcat shell|
    {{R from move}}
    {{R from CamelCase}}
    {{R unprintworthy}}
    }}</text>
        <sha1>kmysdltgexdwkv2xsml3j44jb56dxvn</sha1>
      </revision>
    </page>

so all you need to do is get at the `text`.

1 comments

ks2048 413 days ago

The bigger problem is this is wikitext markup. It would be helpful if they also provide HTML and/or plain text.

I know there are now a couple pretty-good wikitext parsers, but for years, it was a bigger problem. The only "official" one was the huge php app itself.

link

Philpax 413 days ago

Oh, it's godawful; the format is a crime against all things structured. I use `parse-wiki-text-2` [0], which is a fork of `parse-wiki-text`, a Rust library by an author who has now disappeared into the wind. (Every day that I parse Wikipedia, I thank him for his contributions, wherever he may be.)

I wrote another Rust library [1] that wraps around `parse-wiki-text-2` that offers a simplified AST that takes care of matching tags for you. It's designed to be bound to WASM [2], which is how I'm pretty reliably parsing Wikitext for my web application. (The existing JS libraries aren't fantastic, if I'm being honest.)

[0]: https://github.com/soerenmeier/parse-wiki-text-2

[1]: https://github.com/philpax/wikitext_simplified

[2]: https://github.com/genresinspace/genresinspace.github.io/blo...

link