| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by inscrutable 1133 days ago
	My sweet summer child, this is a closely guarded secret. Will only be revealed if perhaps Europe demands it so that copyright holders can sue.

1 comments

gabereiser 1133 days ago

Metadata will show where it came from, should you choose to keep it. Or so they showed on the big screen at I/O today.

link

inscrutable 1133 days ago

maybe you're right, but I'd be skeptical. In a non-snarky way, this shows the data sources used in models to date up to GPT 3.

https://lifearchitect.ai/whats-in-my-ai/

OpenAI paid $2m/year for twitter feeds until Elon cut them off, and Sam Altman has mentioned they'd paid a lot for scientific journals and Reddit mention they'll start charging. Given how central data quality and curation is, if these private data sources give a significant boost, it won't be available for Apache2 models.

link

sebzim4500 1133 days ago

Given Reddit's inability to keep their website functioning (unless you use the far superior old.reddit.com) I find it hard to believe they would be able to stop a motivated developer from scraping the whole site.

link

dontupvoteme 1133 days ago

this is about the time that i expect sites to begin returning intentionally corrupt/incorrect/perhaps outright garbage (subtle or not, probably better subtle so they don't realize it until it's far too late) data in order to intentionally poison enemy wellscraping. where "ethics" dissolve into the inherent raw cannibalistic laws of capitalist ventures.

then you can sell them back the TBs they scraped at a 1000x markup for the real data. or attempt to watermark it so you can prove their illegal(?) usage of your services in their training.

link

KeplerBoy 1133 days ago

You might be right. What a dystopian future that will be. Make a few requests too many and the webserver might think you're scraping data so it gaslights you into reading bullshit.

link

dmix 1133 days ago

Is this sarcasm? I can’t tell.

link

sebzim4500 1133 days ago

Maybe they've been doing that for years and that's why all the advice subreddits turned into creative writing subreddits.

link