The copyright situation around all this is very... interesting. Pretty clear that this dataset is not legal but what about resulting models? What if the texts actually where bought 'properly'?
The race is on to figure out a way to get LLMs to produce content to be used for training other LLMs in a satisfactory way. Eventually the dataset question will get figured out in the courts but if there’s a technique to generate more training data in an automated way then the court decision doesn’t matter.
Edit: also, I don’t believe court decisions can be enforced retroactively so existing LLMs would be safe but I’m most definitely not a lawyer.