Hacker News new | ask | show | jobs
by underlines 377 days ago
yes, every major llm company did it:

illegally using annas archive, the pile, common crawl, their own crawl, books2, libgen etc. and embed it into high dimensional space and do next token prediction on it.