Hacker News new | ask | show | jobs
by ricardobeat 210 days ago
It’s quite unlikely that training data will include duplicate repositories or even forks, that alone would surpass the published dataset sizes.