|
|
|
|
|
by throwaway290
769 days ago
|
|
As usual, license/copyright violation: > Our process to prepare code pretraining data involves several stages. First, we collect a combination of publicly available datasets (e.g., GitHub Code Clean, Starcoder data), public code repositories, and issues from GitHub |
|