Hacker News new | ask | show | jobs
by throwaway290 769 days ago
As usual, license/copyright violation:

> Our process to prepare code pretraining data involves several stages. First, we collect a combination of publicly available datasets (e.g., GitHub Code Clean, Starcoder data), public code repositories, and issues from GitHub