| > Microsoft GitHub is the largest collection of open source code in the world. Microsoft GitHub is in a unique and dominant positions to host and access and distribute most of the open-source code in the world No, it's not in a "unique and dominant position". Open source code is freely available online, it's almost trivial to build a bot to scrape OS code from anywhere on the web (GitHub included). The comparison to the Google Books antitrust falls down completely, Google had a dominant position because it had the resources to scan all books. Anyone can build a collection of almost all open source code. Further to that, all these models (GPT and Image generation) are trained on scraped data, trying to suggest that only GitHub/Microsoft could do it defeats the purpose of trying to establish what the legal rights are over training models with scraped data. We need test cases and precedent, but trying to use this as one is not going to work. Edit: It took me 15 seconds to find that there is a Google Big Query dataset of open source code for GitHub: https://cloud.google.com/blog/topics/public-datasets/github-... and thats been further curated on Hugging Face: https://huggingface.co/datasets/codeparrot/github-code GitHub / Microsoft do not have a monopoly on this data. |
I thought Google had a dominant position because they signed an exclusive deal with the authors guild that explicitly gave them a dominant position.
Anyone else could set up a project to go round libraries and scan books. Google has put more money into it than other organisations, but The Internet Archive has about 20 million scans (https://archive.org/details/texts).