Hacker News new | ask | show | jobs
by Scaevolus 2612 days ago
> Dataset source: wikidump Date: Feb 7, 2019 docs: 5.6M size: 5.3 GB

"wikidump" links to https://dumps.wikimedia.org/enwiki/latest/ , which has thousands of files, none of which are 5GB and make sense. That's a very poor corpus link!

It says "Feb 7, 2019", so it probably means https://dumps.wikimedia.org/enwiki/20190120/ or https://dumps.wikimedia.org/enwiki/20190201/ ... maybe. They don't have any obvious 5.3GB files.