Hacker News new | ask | show | jobs
by concurrentsquar 782 days ago
Reddit may have told OpenAI to pay (probably a lot of) money to legally use Reddit content for training, which is something Reddit is doing with other AI labs (https://www.cbsnews.com/news/google-reddit-60-million-deal-a... ); but GPTBot is not banned under the Reddit robots.txt (https://www.reddit.com/robots.txt).

This is assuming that lmsys' GPT-2 is retained GPT-4t or a new GPT-4.5/5 though; I doubt that (one obvious issue: why name it GPT-2 and not something like 'openhermes-llama-3-70b-oai-tokenizer-test' (for maximum discreetness) or even 'test language model (please ignore)' (which would work well for marketing); GPT-2 (as a name) doesn't really work well for marketing or privacy (at least compared to the other options)).

Lmsys has tested models with weird names for testing before: https://news.ycombinator.com/item?id=40205935

2 comments

Sam Altman was on the board of reddit until recently. I don't know how these things work in SV but I wouldn't think one would go from 'partly running a company' to 'being charged for something that is probably not enforceable'. It would maybe make sense if they did pay reddit for it, because it isn't Sam's money, anyway, but for reddit to demand payment and then OpenAI to just not use the text data from reddit -- one of the largest sources of good quality conversational training data available -- strikes me as odd. But nothing would surprise me when it comes to this market.
That said, it is pretty SV behavior to have one of your companies pay the other. A subtle wealth transfer from OpenAI/Microsoft to Reddit (and tbh other VC backed flailing companies) would totally make sense.

VC companies for years have been parroting “data is the new oil” while burning VC money like actual oil. Crazy to think that the latest VC backed companies with even more overhyped valuations suddenly need these older ones and the data they’ve hoarded.

> A subtle wealth transfer from OpenAI/Microsoft to Reddit (and tbh other VC backed flailing companies) would totally make sense.

That's the confusing part -- the person I responded to posited that they didn't pay reddit and thus couldn't use the data which is the only scenario that doesn't make sense to me.

I suppose a "data transfer" from Reddit to OAI would be valuable for SamA too? Still a transfer of value from one hand to the other, while others (eg. Google) have to pay.

That said, I wouldn't be surprised if they pay now. They can't get away with scraping as easily now that they are better-known and commercially incentivized.

Maybe training on whatever this is started before the licensing deal?
robot.txt doesnt really mean anything, I used to work for a company that scraped the web and this was literally not a concern. That being said, using data for training LLMs is a new things and potential lawsuits going reddit's way are a possiblity, we can't really know.

One note, its name is not gpt-2 it is gpt2 which could indicate its a "second version" of the previous gpt architecture, gpt-3, gpt-4 being gpt1-3, gpt1-4. I am just speculating and am not an expert whatsoever this could be total bullshit.