|
|
|
|
|
by NitpickLawyer
150 days ago
|
|
I train coding models with RLVR because that's what works. There's ~0.000x good signal in mailing lists that isn't in old mailing lists. (and, since I can't reply to the other person, I mean old as in established, it is in no way a dig to lwn). You seem to be missing my point. There is 0 incentives for AI training companies to behave like this. All that data is already in the common crawls that every lab uses. This is likely from other sources. Yet they always blame big bad AI... |
|
some scrapers might skip out on already-scraped sources, but easy to imagine that some/many just would not bother (you don't know if it's updated until you've checked, after all). And to some extend you do have to re-scrape, if just to find links to the new stuff.