|
|
|
|
|
by sillysaurusx
1060 days ago
|
|
I’d like to distribute training data. (I’m one of the authors of The Pile, which was recently knocked offline when The Eye stopped hosting it due to threats.) I also have the entire books3 dataset — the original epub files, not text extractions — sitting around on a hard drive. Many people have wanted metadata or to reprocess the set for their own purposes. I’d like to release those, but distributing 190,000 epubs is a little… hard. Sadly 50TB of traffic per month is almost nothing when it comes to disturbing 800GB datasets. I’d spend 150 euros a month for a solution, but it’d need to be heftier. |
|