Hacker News new | ask | show | jobs
by 0cf8612b2e1e 828 days ago
I built one of these myself that I keep on my laptop. Never had real need to use it, but glad I have .

I keep meaning to do the same thing with Wikipedia. Although the Wikipedia dumps are so inscrutably named and seemingly undocumented it seems the organization does not want me to pursue the idea.

1 comments

I've had the same problem with Fandom née Wikia dumps. Just gigabytes of XML with questionable adherence to schemas. Fandom also has a ton of custom-to-Fandom tags which are a further pain to handle.

Pulling useful content out of the dumps has been an exercise in frustration. I'm sure I could figure something out if I had a bunch of time to dedicate to the effort.

If I just had sqlite dumps they'd be trivial to work with and I'd be much happier with them.

Those sites are so ad infested, I am amazed they offer dumps to get the content. Now I am similarly interested in pursuing this idea, but possibly with the exact same amount of tolerance for pain that you have reported.
The ad cancer was part of my original motivation for downloading dumps. I've now found a a lot of Fandom wikis have links to their dumps you can't actually download because they're on S3 buckets that require keys or have download limits. It's infuriating and I think maybe I'm lucky I grabbed a few dumps when I did.

Fandom is usually the first example I think of whenever I hear the word "enshitification". First Wikia ate all the independent wikis because they offered free/managed MediaWiki hosting. Then slowly started making Wikia worse until the full Fandomization. Now the site is literally unusable without an ad blocker and all of that GFDL content on the site is locked behind obfuscation and incompetence. I desperately miss the old Wookiepedia and MemoryAlpha.