Not a lawyer but would assume downloading material from libgen is, in the vast majority of cases, illegal because it's a breach of copyright or similar. That’s gotten Meta in quite a spectacle of late [1]
CommonCrawl is composed of copyrighted contents too. You gain copyright on your work automatically the moment you created it, including this very comment.
One could argue that using copyrighted content in LLMs, much like reposting, should fall under fair use. This is also Microsoft's claim in the GitHub Copilot lawsuits. It's up to the court to decide though. (IANAL)
It’s a catchy term, but loaded. Copyright protects only original expression, not ideas and information. So if a computer algorithm reads the former and outputs the latter, arguably copyright isn’t involved at all.
There are plenty of good counterarguments to this as well, when you consider the effects of automation and scale. I’m definitely interested in seeing how the jurisprudence develops as these cases go through the courts.