| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by throwaway290 769 days ago
	As usual, license/copyright violation: > Our process to prepare code pretraining data involves several stages. First, we collect a combination of publicly available datasets (e.g., GitHub Code Clean, Starcoder data), public code repositories, and issues from GitHub