Hacker News new | ask | show | jobs
by dom96 1103 days ago
I've been considering using Reddit data to pre-seed the content in a successor to Reddit. Though I am unsure how that would stand legally.

As a side note, I created an alternative Reddit API[1] and Reddit didn't like that so much they banned my 13 year old Reddit account.

1 - https://api.reddiw.com

6 comments

IANAL. For the US, users grant Reddit a license to use their content when they post it. The users still own that content. Reddit's license does not extend to your reuse of it[0], nor have the underlying users directly granted you permission, so it would not be legal (in the US) for you to reuse like that.

[0] "you may not... license, sell, transfer, assign, distribute, host, or otherwise commercially exploit the Services or Content" https://www.redditinc.com/policies/user-agreement-september-...

Wouldn't that mean it would be down to the individual users who still own each bit of content to issue a DMCA takedown if they objected?

I imagine the number of such requests would be small.

Ah. The old “I did so much copyright violation it would be infeasible for everyone I took content from to enforce” defence. I see nothing that could go wrong.
Posting that you’re going to be “using Reddit data to pre-seed the content” may make it a bit harder to dodge Reddit in court.
Although prompting “write a comment replying to the text ‘<snip> in the style of u/landfe“ would yield something I copyrightable…
I was chatting about this with some friends. If we had a million or so spare, just fork Reddit. Grab the latest open source version of Reddit, pay the pushshift guys for the most up-to-date dump they have and get it in.

Make a system for claiming your old Reddit account. I'm guessing if you try to use OAuth, Reddit will just ban you. So you need to get creative, probably make an extension that grabs the users sessionid from their cookies or something (or let people copypaste it in if they are technical enough).

Fun to imagine but unfortunately probably won't happen.

Noone will use it
Just launder it through an LLM, problem solved.
Indeed. Could call it something like the RedditCrawl corpus.
don’t even need reddit with an llm, I did some back of the napkin token math and you can fake a year of activity for a couple thousand dollars (varies by number of users and comment length of course) - hell, you can even make it look active in real-time and respond to real users - as long as you give it some guidance about commenting style (as in not the default gpt 8th grade essay style) it’s very hard to tell
Adversarial interoperability like this would be a great way to neutralise network lock-in effects and create a more level competetive playing field between social media companies. I think we should enshrine protections for this kind of thing.

There was a strong 2019 precedent in favour of allowing this kind of scraping of public content (from LinkedIn in that case): https://www.techdirt.com/2019/09/10/big-news-appeals-court-s...

> As a side note, I created an alternative Reddit API[1] and Reddit didn't like that so much they banned my 13 year old Reddit account.

"I broke Reddit's TOS deliberately and repeatedly and they banned me!" is another way to put it. But it doesn't sound as good and because of the current zeitgeist people will tend to side with you anyway. Perfect timing for you :)

Having first rephrased it all via Chat GPT.

Load up those liabilities.

Do you mean using ChatGPT this way would also be a liability?