Hacker News new | ask | show | jobs
by codekansas 1091 days ago
> including personal information obtained without consent

Obtained from (check notes) public internet forums

> For the 16 plaintiffs, the complaint indicates that they used ChatGPT, as well as other internet services like Reddit, and expected that their digital interactions would not be incorporated into an AI model.

You've got to be incredibly naive if you think public Reddit data isn't used to train ML models, not least by Reddit themselves

1 comments

Or maybe when you started posting on reddit, LLMs hadn't been invented yet. This is true for 99.9% of the people who post on Reddit.
People have been training ML models on data scraped from Reddit since at least 2015 [1], back when there were less than a million users

[1] https://www.kaggle.com/datasets/ehallmar/reddit-comment-scor...

LLMs were invented at least five years ago (BERT) though you could make the case for a few years earlier. My guess is the majority of Reddit users are new since then, not 0.1%?
Your guess is that the majority of Reddit users have joined since 2018? 1) I do not think that is correct, 2) the mere existence of LLMs isn't public awareness about how LLMs are trained, and 3) you know exactly what I'm saying and that 99.9% might be slight hyperbole.
1: Reddit has ~1.6B monthly active users, compared to 0.3B in 2018. [1] So 2x user growth seems more likely to me than not.

2: You're the one who went with "invented" ;)

3: I know you're exaggerating, but I think you think you're exaggerating much less than you actually are.

[1] https://www.bankmycell.com/blog/number-of-reddit-users/

> Your guess is that the majority of Reddit users have joined since 2018?

It's not really important to the debate around unlicensed use of copyrighted works to train AI models, but it wouldn't surprise me at all if the majority of Reddit users have joined since 2018. It's tough to get reliable active user counts, but they seem to have risen substantially over the past five years.

It also wouldn't surprise me if the majority of Reddit users were indeed from prior to 2018, but at the very least > 2018 would be a very substantial minority.

My account(s) are 17 years old on reddit.
Yes? Mine is nearly that old. But we are very clearly the minority!