Hacker News new | ask | show | jobs
by a2128 438 days ago
You don't even need to deal with any XML formats or anything, they publish a complete dataset on Huggingface that's just a few lines to load in your Python training script

https://huggingface.co/datasets/wikimedia/wikipedia