Hacker News new | ask | show | jobs
by sfriedr 1073 days ago
Of this is true, it would be something close of an insane situation: One of the largest datasets, that the largest companies are using to train their models (probably; many of the best LLMs have technical reports that raise more questions rather than answer them) being forced to live an obscure existance on torrents.

From a scientific point of view this is very problematic because few safeguards exist that guarantee that the dataset is not tampered with (as is the case if you'd upload it to Zenodo, which providea some guarantee of immutability).

How about trying to upload the Pile to Zenodo? Only half-joking :D