Hacker News new | ask | show | jobs
by inscrutable 1133 days ago
My sweet summer child, this is a closely guarded secret. Will only be revealed if perhaps Europe demands it so that copyright holders can sue.
1 comments

Metadata will show where it came from, should you choose to keep it. Or so they showed on the big screen at I/O today.
maybe you're right, but I'd be skeptical. In a non-snarky way, this shows the data sources used in models to date up to GPT 3.

https://lifearchitect.ai/whats-in-my-ai/

OpenAI paid $2m/year for twitter feeds until Elon cut them off, and Sam Altman has mentioned they'd paid a lot for scientific journals and Reddit mention they'll start charging. Given how central data quality and curation is, if these private data sources give a significant boost, it won't be available for Apache2 models.

Given Reddit's inability to keep their website functioning (unless you use the far superior old.reddit.com) I find it hard to believe they would be able to stop a motivated developer from scraping the whole site.
this is about the time that i expect sites to begin returning intentionally corrupt/incorrect/perhaps outright garbage (subtle or not, probably better subtle so they don't realize it until it's far too late) data in order to intentionally poison enemy wellscraping. where "ethics" dissolve into the inherent raw cannibalistic laws of capitalist ventures.

then you can sell them back the TBs they scraped at a 1000x markup for the real data. or attempt to watermark it so you can prove their illegal(?) usage of your services in their training.

You might be right. What a dystopian future that will be. Make a few requests too many and the webserver might think you're scraping data so it gaslights you into reading bullshit.
Is this sarcasm? I can’t tell.
Maybe they've been doing that for years and that's why all the advice subreddits turned into creative writing subreddits.