Hacker News new | ask | show | jobs
by Aedelon 113 days ago
Yeah, that HF dataset page is rough. 247+ threads, mostly DMCA reports, archive-locked fics scraped without consent, dataset reuploaded after takedown. The AO3 community had every reason to be furious.

Not RWKV-specific though. Most large corpora have the same sources in them, they just don't list them explicitly. Whether the transparency makes it better or worse is a real question.