| HN Mirror

Unfortunately, that's probably my fault.

I foolishly had a big head, and felt like it was so clear what needed to happen: we needed a dataset of "every book ever."

books3, one of the largest components of The Pile, is 196,640 books. https://twitter.com/theshawwn/status/1320282149329784833?lan...

I'm proud I did that. And I'm also horrified that my perspective was so incredibly off-base. I get it now. I was blinded by my own thick skull.

The sheer quantity of knowledge in books3 is almost unfathomable. I find it hard to think too much about it, because you end up concluding that AIs are the only entity on earth that stand a chance of absorbing this much knowledge.

I just pulled up the books3 index of "2" -- i.e. all books starting with the number 2: https://gist.github.com/shawwn/85cbaf53cb6bb57c49f1688e70532...

That's the truncate file. If you go to the full file, then command-F for "sex", there are 93 hits.

93 sex books. In just the "2" section.

All the sections are here: http://the-eye.eu/public/Books/Bibliotik/

Like Hattori Hanzo, I feel I can say with no ego that books3 is my finest work. https://www.youtube.com/watch?v=az2dSNXRKOc&ab_channel=kurts...

You would not believe how hard it is to get 193 thousand books converted into perfectly-readable markdown. Even the software books have perfect formatting -- every table, every code snippet, I annihilated every corner case I could find. Because it needed to be perfect for humans, to have any chance of being perfect for AI.

But I was a fool. My ego blinded me to the fact that it's a bad idea to do what I truly believed was in everyone's best interest: that "because any human could read any of those books, AI should know all of those books."

It's not a human. It's a markov chain. Having it autocomplete sex books is a bad idea for business purposes. I wanted The Pile to be business-grade. My work here has endangered that goal.

And I don't know how it could have ended up any differently. Because I don't know how to sort 193 thousand books into reasonable selections that you may or may not want to exclude. Our goal with The Pile was to let you decide. Who among us would dare feel that they could judge 193 thousand books from their titles alone?

It's a job for filtering and heuristics and analysis and hard work -- none of which I did. I spent around three days turning Aaron Swartz' html2text library into the best damn "epub to training data converter" ever made. Yet my accomplishments feel so hollow, for the reasons you observed here.

Stella and Leo put so much more thought and care into their contributions. I try to take solace in the fact that The Pile lets you pick and choose which portions of training data you want to use: https://github.com/EleutherAI/the-pile

But of course, the irony is, even though The Pile is so flexible and modular, most people will just use the defaults. And by default, The Pile includes.... most of humanity's knowledge. A gargantuan pile of books. So many books that you could fill an entire neighborhood with nothing but books, and you'd still have a hundred thousand books left over.

I don't know how to feel about all that. I wanted to make an impact. I guess I did. Time will tell whether it's a net gain.

Luckily, OpenAI made these same mistakes. That's the grain of truth I cling to. They almost certainly made these exact same mistakes, because their goal was to make a million dollars a year (which they achieved), and to do so as quickly as possible.

Now they have to be super paranoid with their filters, and GPT-J is at least slightly less shocking than GPT-3 thanks to everyone not-me who worked on The Pile.