|
|
|
|
|
by roenxi
440 days ago
|
|
I've written some unfathomably bad web crawlers in the past. Indeed, web crawlers might be the most natural magnet for bad coding and eye-twitchingly questionable architectural practices I know of. While it likely isn't the major factor here I can attest that there are coders who see pages-articles-multistream.xml.bz2 and then reach for a wget + HTML parser combo. If you don't live and breath Wikipedia it is going to soak up a lot of time figuring out Wikipedia's XML format and markup language, not to mention re-learning how to parse XML. HTTP requests and bashing through the HTML is all everyday web skills and familiar scripting that is more reflexive and well understood. The right way would probably be much easier but figuring it out will take too long. Although that is all pre-ChatGPT logic. Now I'd start by asking it to solve my problem. |
|
https://huggingface.co/datasets/wikimedia/wikipedia