Hacker News new | ask | show | jobs
by humanistbot 1044 days ago
Everything in archives pre-2021 is still untainted. All major social media, Q&A, code repos, and archive.org are timestamped. It taints future collection of training data, but not existing collection of training data.
1 comments

What's the plan then, to coast on pre-2021 data forever? How much utility would todays LLMs have if they were trained on fossilized archives of the internet from 10, 15, 20 years ago?