Hacker News new | ask | show | jobs
by fancyfredbot 769 days ago
"These datasets, created by former employees who are no longer with OpenAI, were last used in 2021 and deleted due to non-use in 2022." -

Deleting the dataset because of non-use sounds completely implausible. It says the dataset is 67B tokens, which is less than 1TB of data. Why would you bother to delete it given it would cost more or less nothing to keep?

1 comments

If something is a legal liability to have, non-use is probably 2 seconds after you finish using it (in their case, finishing training, just keeping the weights)
Exactly. The reason that they stopped using the dataset and the reason they deleted it are likely the same - legal liability.

Obviously not in their interests to state that but when this is the best alternative explanation they can offer they might as well have.

Personally I support the use of these books for training AI but I think this needs to be decided in court and/or with legislation, not hidden under the carpet.

>Personally I support the use of these books for training AI

I need a better understanding of postgres. Can you write me an A-Z book on postgres? I won't actually buy it from you, just grab the text, train a model, and get the model to answer all questions I have. But the book would be super helpful please. Oh I'm also going to sell a service on this model too...because I like money.

I get sarcasm isn't exactly a great form of debate, but it felt suitable here. I ABSOLUTELY understand why people don't want AI training on their books.

Thanks, that's a very interesting perspective which I hadn't previously considered.