Hacker News new | ask | show | jobs
by sillysaurusx 1060 days ago
I’d like to distribute training data. (I’m one of the authors of The Pile, which was recently knocked offline when The Eye stopped hosting it due to threats.)

I also have the entire books3 dataset — the original epub files, not text extractions — sitting around on a hard drive. Many people have wanted metadata or to reprocess the set for their own purposes. I’d like to release those, but distributing 190,000 epubs is a little… hard.

Sadly 50TB of traffic per month is almost nothing when it comes to disturbing 800GB datasets. I’d spend 150 euros a month for a solution, but it’d need to be heftier.

2 comments

Just rent a seedbox. Example: https://pulsedmedia.com/
What about P2P like a torrent? I think plenty of people would be open to seeding this. (I, personally, would love to see things like this)