| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sillysaurusx 1814 days ago

They super earned it. From day one, everyone showed up with a level of drive and determination I haven't seen elsewhere.

My name is on The Pile paper https://arxiv.org/abs/2101.00027 but I didn't do anything except make the books3 dataset. Stella, Leo, and everyone else did the hard work. You know, the work that's "actually useful to the scientific community." I didn't even help them hunt for typos, even though Stella asked me to. I was just like, sorry, no time, I have to focus on my own research.

Imagine saying "nah" to helping shape one of the most important open source AI research projects of the coming years. Training data quality is becoming more and more of a focus.

Lemme tell you a quick story.

When https://venturebeat.com/2021/06/09/eleutherai-claims-new-nlp... come out, this quote caught my eye:

> But EleutherAI claims to have performed “extensive bias analysis” on The Pile and made “tough editorial decisions” to exclude datasets they felt were “unacceptably negatively biased” toward certain groups or views.

When I read this, I felt astonished that Eleuther was yet again trying to pose as the cool super-progressive AI lab. To my knowledge, no such thing ever happened. And I was involved with The Pile back when it was just me and Leo memeing in Discord DMs about how the world needed some quality training data once and for all.

I went to Stella in DMs (you should follow her too! https://twitter.com/BlancheMinerva/status/139408950872390042...) and was like, what the hell? I don't understand how this could possibly be true. What are these supposed "tough editorial decisions"?

Stella calmly explained to me that the US Congressional Record had been considered and rejected for inclusion in The Pile. I thought "Big deal, who the hell cares?" while saying "Okay, but I don't know what that is."

It’s a written record of all statements made in the US legislature. It was also somewhere between 1GB and 15GB, which would have been a significant portion of The Pile's total size.

I'm going to quote from her private DMs with me, which I haven't asked for permission to do. So this is technically another bad move by me. But she put it so perfectly, I was stunned:

> For half the history of the US, black people were slaves. For something like 75% of it, black people didn’t have the right to vote. A modern reader didn’t think there wasn’t a high proportion of extremely racist content, that would primarily be an inditement of modern people lol.

> The reason we first looked at it was that we included a similar document for the EU Parlement

It took me a few minutes to come to my senses, but I finally realized:

(a) this dataset likely contained a huge proportion of content that, politics aside, would be a Very Bad Idea to include in your ML models by default;

(b) Eleuther had just been trying to do good work this whole time

So you know, when you're in that situation, you can choose to either keep believing your own false ideas, or you can pay attention to empirical evidence and change your behavior. And empirically, I had been a massive asshole to everyone since pretty much the beginning. The only thing I helped with was books3 and arranging The Eye to get them some reliable hosting. (Shoutout to The Eye, by the way. Help 'em out if you can: https://the-eye.eu/public/AI/)

And there's my name, right there on the paper.

It's even worse than I described. I put the paper in jeopardy, because they were submitting it to a conference with strict anonymity rules. I had no idea about it (no one told me). I ended up so happy to see my name on a real arxiv paper that I tweeted out some self-congratulatory bullshit, and quote-tweeted something linking to The Pile. It was a few days into the anonymity period, but nonetheless, it was a violation of the anonymity rules. A lot of people saw that tweet, and the whole point of the rules is to ensure that people don't get unfair advantages by advertising on social media.

When they came to me in DMs apologizing profusely for not talking with me about it, and asking me to delete the tweet, I basically told them to go shove a spoon up their.... because I didn't agree to any rules, and the idea that The Pile should go radio silent for five months on social media struck me as completely crazy.

In hindsight, I was... just awful. So I mean, me posting this is like, the absolute minimum I can do. They've been the ones working for like a year to make all of this happen. Ended up feeling like a fraud, since everyone thinks highly of my ML work, and here I'd been nothing but problematic for a group of people who are just trying to ship good scientific work.

Fast forward to today, and the results are clear. Go help Eleuther: https://www.eleuther.ai/ They're cool, and you'll get a shot at changing the world. I'm not sure you even have to be particularly skilled; some of the most valuable work was done by people who just showed up and started doing things, e.g. making the website look a little nicer, or making a cool logo.

3 comments

reitzensteinm 1814 days ago

This is probably one of the best apologies I've ever read.

link

ShamelessC 1814 days ago

The quote from the direct message made me respect Eleuther much more. Largely because I had no idea such ethical considerations were even being made.

Understanding the biases of these datasets is clearly more nuanced than I realized and I'm glad Stella had a nuanced understanding here.

link

sillysaurusx 1814 days ago

Exactly. This was the type of mistake that OpenAI could easily have made. I could see myself including this historical dataset without giving it a second thought. After all, the more data, the better, right?

One of The Pile's goals was to point out how tricky that can be. We've all seen how effortlessly Copilot spits out GPL code by rote; one wrong prompt would be all it takes to start spewing a lot things that no one wants to hear, if you have the wrong sort of data.

When you train with The Pile, you know exactly what you're getting, because you can take whatever parts you want and ignore the rest. It's a modular dataset. But defaults still matter -- by default, everyone will train on everything. Maybe OpenAI trained on the wrong thing, and maybe that's why they're forcing everyone to use their filters now. Whereas people can "just go train on everything in The Pile" and not have to worry.

(Once upon a time, the plan was to include a dump of Literotica in The Pile, which you can still find here: https://the-eye.eu/public/AI/pile_preliminary_components/ I argued heavily in favor of this, and thought it was totally lame when they decided to drop it.

In hindsight, that was a close call. AI Dungeon proves that it's easy to carelessly include things that can bite you later: https://gitgud.io/AuroraPurgatio/aurorapurgatio#aurorapurgat...

Maybe some people want their models to include that sort of thing, but it shouldn't be the default. People shouldn't have to worry that the defaults will be "Whoa, I only wanted to make a Q&A system for my business; why is it reciting love poems?"

Stella saw that, I think. I didn't.

link

d13 1814 days ago

So what’s the rationale for including so much “romance” literature in The Pile? My innocent “walk in the park” prompt turned extremely graphic for no apparent reason.

link

sillysaurusx 1814 days ago

Unfortunately, that's probably my fault.

I foolishly had a big head, and felt like it was so clear what needed to happen: we needed a dataset of "every book ever."

books3, one of the largest components of The Pile, is 196,640 books. https://twitter.com/theshawwn/status/1320282149329784833?lan...

I'm proud I did that. And I'm also horrified that my perspective was so incredibly off-base. I get it now. I was blinded by my own thick skull.

The sheer quantity of knowledge in books3 is almost unfathomable. I find it hard to think too much about it, because you end up concluding that AIs are the only entity on earth that stand a chance of absorbing this much knowledge.

I just pulled up the books3 index of "2" -- i.e. all books starting with the number 2: https://gist.github.com/shawwn/85cbaf53cb6bb57c49f1688e70532...

That's the truncate file. If you go to the full file, then command-F for "sex", there are 93 hits.

93 sex books. In just the "2" section.

All the sections are here: http://the-eye.eu/public/Books/Bibliotik/

Like Hattori Hanzo, I feel I can say with no ego that books3 is my finest work. https://www.youtube.com/watch?v=az2dSNXRKOc&ab_channel=kurts...

You would not believe how hard it is to get 193 thousand books converted into perfectly-readable markdown. Even the software books have perfect formatting -- every table, every code snippet, I annihilated every corner case I could find. Because it needed to be perfect for humans, to have any chance of being perfect for AI.

But I was a fool. My ego blinded me to the fact that it's a bad idea to do what I truly believed was in everyone's best interest: that "because any human could read any of those books, AI should know all of those books."

It's not a human. It's a markov chain. Having it autocomplete sex books is a bad idea for business purposes. I wanted The Pile to be business-grade. My work here has endangered that goal.

And I don't know how it could have ended up any differently. Because I don't know how to sort 193 thousand books into reasonable selections that you may or may not want to exclude. Our goal with The Pile was to let you decide. Who among us would dare feel that they could judge 193 thousand books from their titles alone?

It's a job for filtering and heuristics and analysis and hard work -- none of which I did. I spent around three days turning Aaron Swartz' html2text library into the best damn "epub to training data converter" ever made. Yet my accomplishments feel so hollow, for the reasons you observed here.

Stella and Leo put so much more thought and care into their contributions. I try to take solace in the fact that The Pile lets you pick and choose which portions of training data you want to use: https://github.com/EleutherAI/the-pile

But of course, the irony is, even though The Pile is so flexible and modular, most people will just use the defaults. And by default, The Pile includes.... most of humanity's knowledge. A gargantuan pile of books. So many books that you could fill an entire neighborhood with nothing but books, and you'd still have a hundred thousand books left over.

I don't know how to feel about all that. I wanted to make an impact. I guess I did. Time will tell whether it's a net gain.

Luckily, OpenAI made these same mistakes. That's the grain of truth I cling to. They almost certainly made these exact same mistakes, because their goal was to make a million dollars a year (which they achieved), and to do so as quickly as possible.

Now they have to be super paranoid with their filters, and GPT-J is at least slightly less shocking than GPT-3 thanks to everyone not-me who worked on The Pile.

link

Hendrikto 1814 days ago

> > EleutherAI claims to have performed “extensive bias analysis” on The Pile and made “tough editorial decisions” to exclude datasets they felt were “unacceptably negatively biased” toward certain groups or views.

> When I read this, I felt astonished that Eleuther was yet again trying to pose as the cool super-progressive AI lab.

So they traded biases inherent in the dataset for intentionally introduced biases. Does not sound super progressive to me, to be quite honest.

Focus on you research, do not try to be the morality judge and jury…

link