Hacker News new | ask | show | jobs
by VierScar 503 days ago
Wouldn't it be easier to cutoff pre-2020-ish, and ask it to create the transformer architecture of gpt? 1900 is so long ago I doubt most documents are good quality if they've been digitised at all. Most likely just low quality scanned images of inconsistent, half-illegible typewriter documents. Transcribed with OCR at best.
2 comments

The problem I see with any date after the popularity of the internet is that you just can't be sure of the right date. A lot of traditional web forums now have backdated forum posts that are clearly made by LLM with an implausible date: https://hallofdreams.org/posts/physicsforums/
You can use CommonCrawl - which has massive datasets going back to 2008 - and the Internet Archive.
Also so little training data from that era. Like, exponentially more data was created after, say, <year when most records become digitized = 1970>