Hacker News new | ask | show | jobs
by giantrobot 831 days ago
I've had the same problem with Fandom née Wikia dumps. Just gigabytes of XML with questionable adherence to schemas. Fandom also has a ton of custom-to-Fandom tags which are a further pain to handle.

Pulling useful content out of the dumps has been an exercise in frustration. I'm sure I could figure something out if I had a bunch of time to dedicate to the effort.

If I just had sqlite dumps they'd be trivial to work with and I'd be much happier with them.

1 comments

Those sites are so ad infested, I am amazed they offer dumps to get the content. Now I am similarly interested in pursuing this idea, but possibly with the exact same amount of tolerance for pain that you have reported.
The ad cancer was part of my original motivation for downloading dumps. I've now found a a lot of Fandom wikis have links to their dumps you can't actually download because they're on S3 buckets that require keys or have download limits. It's infuriating and I think maybe I'm lucky I grabbed a few dumps when I did.

Fandom is usually the first example I think of whenever I hear the word "enshitification". First Wikia ate all the independent wikis because they offered free/managed MediaWiki hosting. Then slowly started making Wikia worse until the full Fandomization. Now the site is literally unusable without an ad blocker and all of that GFDL content on the site is locked behind obfuscation and incompetence. I desperately miss the old Wookiepedia and MemoryAlpha.