| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by samwillis 1132 days ago

> Microsoft GitHub is the largest collection of open source code in the world. Microsoft GitHub is in a unique and dominant positions to host and access and distribute most of the open-source code in the world

No, it's not in a "unique and dominant position". Open source code is freely available online, it's almost trivial to build a bot to scrape OS code from anywhere on the web (GitHub included).

The comparison to the Google Books antitrust falls down completely, Google had a dominant position because it had the resources to scan all books. Anyone can build a collection of almost all open source code.

Further to that, all these models (GPT and Image generation) are trained on scraped data, trying to suggest that only GitHub/Microsoft could do it defeats the purpose of trying to establish what the legal rights are over training models with scraped data.

We need test cases and precedent, but trying to use this as one is not going to work.

Edit:

It took me 15 seconds to find that there is a Google Big Query dataset of open source code for GitHub: https://cloud.google.com/blog/topics/public-datasets/github-...

and thats been further curated on Hugging Face: https://huggingface.co/datasets/codeparrot/github-code

GitHub / Microsoft do not have a monopoly on this data.

9 comments

rjmunro 1132 days ago

> Google had a dominant position because it had the resources to scan all books.

I thought Google had a dominant position because they signed an exclusive deal with the authors guild that explicitly gave them a dominant position.

Anyone else could set up a project to go round libraries and scan books. Google has put more money into it than other organisations, but The Internet Archive has about 20 million scans (https://archive.org/details/texts).

mschild 1132 days ago

There certainly are other spaces where open source code is hosted and available, but the default for most is GitHub. I think it's in a similar position to Google 10 years ago. Sure there are other search engines, but Google is by and large the standard one.

That does put Microsoft in the unique position to have direct unfettered access to any and all open source code on GitHub without restrictions. Unless you or I get the same kind of direct access without rate limiting and antibot protection, then they do dominate and have an advantage over everyone else.

reissbaker 1132 days ago

Not sure if you posted before the edits, but I'm pretty convinced by them, seeing as how there are multiple alternatives with the same data.

scarface74 1132 days ago

it’s really not that hard to

git clone

git set origin…

It’s much harder to copy Google’s index.

ChatGTP 1132 days ago

You think it's practical to do this with almost all the public repos on Github?

scott_w 1132 days ago

That's not Github's fault or Github's problem, from an antitrust perspective. If they went out of their way to make it difficult, you might have an argument but, as far as I know, they aren't. It's just practically difficult by the nature of the problem.

zanellato19 1132 days ago

They rate limit, so they do make it difficult though

scott_w 1132 days ago

They rate limit to protect their infrastructure, not to make it difficult. This is not anticompetitive.

insanitybit 1132 days ago

Yeah, I think so.

jackdaniel 1132 days ago

This is addressed in the same paragraph - you can't scan/download "whole" github because you'll be throttled.

neximo64 1132 days ago

Are you actually throttled if you try to git clone or is that what the theory is, or is the assumption that it uses API calls to scrape through github?

Has anyone actually tried, because i've cloned lots of repos and have never been throttled. I'd go so far as to say the author of that post has never even tried it.

jackdaniel 1132 days ago

I'm not arguing for or against whether they are in the dominant position; what I'm doing is pointing out that the grandparent quoted part of the text (and argues against it) without quoting the justification the author provided that is directly relevant to what they say.

> There’s an important notion to address here. Open source code on GitHub might be thought of as “open and freely accessible” but it is not. It’s possible for any person to access and download one single repo from GitHub. It’s not possible for a person to download all repos from Github or a percentage of all repos, they will hit limitations and restrictions when trying to download too many repos. (Unless there’s some special archives or mechanisms I am not aware of).

logifail 1132 days ago

> Has anyone actually tried, because i've cloned lots of repos and have never been throttled

(Full disclosure: I have some pretty serious data hoarding issues)

When someone says "I've cloned lots of repos and have never been throttled" I'm afraid I immediately start wondering whether "lots" means multiple GB or multiple TB ... or more!

quickthrower2 1132 days ago

21Tb of data, they might rate limit you! But might be possible via proxies. But only public repos.

neximo64 1132 days ago

Copilot was only trained on public repos. Id be surprised if you were throttled.

scott_w 1132 days ago

I'd be surprised if they didn't throttle anyone trying to download 21TB of data. And I wouldn't judge them for it.

williamcotton 1132 days ago

There’s no need to crawl for your own dataset:

https://pile.eleuther.ai/

hanselot 1132 days ago

@article{pile, title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor}, journal={arXiv preprint arXiv:2101.00027}, year={2020} }

So if I understand this correctly, the Pile is for code from 2020 backwards? If I wanted anything released in the past 3 years, say something in the SOTA AI space (where a month is a lifetime), I would need the scraper again?

I don't follow how this can compare to direct, live, unrestricted access. I suppose this is just my own hatred of Microsoft shining through. Of course we should accept the status quo, because how dare we suggest Microsoft could operate in a manner that is anti-competitive.

For anyone else trying to catch up, just rent a datacenter, write a crawler, deal with all the intricacies of keeping it in sync in real-time. This sounds trivial, simple even.

I wonder why nobody is doing it? Perhaps everyone doesn't have access to petabytes of storage space, unlimited bandwidth, unlimited proxy-jumps etc.

So the alternative is to buy github?

williamcotton 1132 days ago

I wonder why nobody is doing it? Perhaps everyone doesn't have access to petabytes of storage space, unlimited bandwidth, unlimited proxy-jumps etc.

There are multiple private companies and public institutions that are currently training LLMs.

The work that it required to train an LLM is actually in support of fair use, just as it was with regards to Google scanning books.

goodpoint 1132 days ago

> No, it's not in a "unique and dominant position". Open source code is freely available online, it's almost trivial to build a bot to scrape OS code from anywhere on the web (GitHub included).

Absolutely wrong. GitHub is doing way more than just hosting code. It hosts bugtrackers, CI and much more. For most FOSS project it's the ONLY place where you can go and submit a bug report.

It's not just a repository, it's a communication tool and refuses to interoperate with other platform.

This is monopoly, just like NPM and Linkedin. Microsoft never changes.

bilqis 1132 days ago

Github also has access to private repositories.

samwillis 1132 days ago

They don't use privet repositories to train Copilot.

zelphirkalt 1132 days ago

Maybe not yet. All just a change of their terms away. Oh you don't like it? We will give you 2 weeks to migrate. Perhaps you want this other more expensive subscription?

Just like with other code they should not be using as they do, they would probably run another "ask questions later" approach.

bilqis 1132 days ago

They say they don’t

az226 1132 days ago

I’m sure you think this is a clever reply but the reality is that GitHub wouldn’t even begin to think if that were even technically possible. If it got out that it trained on confidential customer data, it would be game over. The risk is so stupidly large nobody in their right mind would take it. So yeah, if they say they don’t, they don’t.

account42 1132 days ago

Yet its ok to train of copyleft code?

aleph_minus_one 1132 days ago

Copyleft code is (typically) not confidential.

unreal37 1132 days ago

I don't understand why people just automatically doubt things that companies say when they can be sued (or would otherwise destroy their business) if they are lying about it. Seems unnecessarily pessimistic.

Lio 1132 days ago

People doubt Microsoft because they've historically run a very aggressive business and done things of questionable morality many times.

They've been to court and they've lost and it definitely hasn't destroyed their business one bit.

For example, Microsoft subsidiary LinkedIn routed customer email through their servers so that they could scrape it. They did that without customer knowledge via a dark patten.

They later apologised for doing it but still used it to propel the company's growth. In the end it didn't hurt anything but their reputation for respecting people's privacy.

Microsoft's own anti-trust history is littered with exceptional behaviour too. They are the size they are now by dint of super aggressive business practices.

phpisthebest 1132 days ago

Normally because history shows us that redress via the court systems is rarely punitive to a company the size of Microsoft, further Microsoft has a long history of lying to its customers with seemingly no impact on its business.

yulaow 1132 days ago

I mean, we discovered that the whole car industry was lying flagrantly on their emission tests which had the potential of destroying the whole business and there were A LOT of people who knew about it and could talk anytime

Why wouldn't sw companies do the same?

phpisthebest 1132 days ago

And how many of those companies were materially impacted or had more than a couple quarters of negative consumer backlash?

None.... so the grandparents comment is with out evidence that either consumers or regulators hold companies to account

bilqis 1132 days ago

But will that actually be against ToS or copyright? Many people tend to say that copilot learning from OSS doesn’t infringe any copyright and is no different from a person just learning from someone else’s work. So how is it different if copilot is learning from private repositories? Or eg from leaked source code?

circuit10 1132 days ago

Isn’t it illegal to learn from leaked source code? Or even to view it at all?

nindalf 1132 days ago

I'm frequently told on HN that Big Tech would willingly, flagrantly violate GDPR like its nothing. Even if the upside of collecting that info was minimal and the downside was 4% of global revenue.

I guess if they can do that, then what's a small lie about private repos between friends.

sshumaker 1132 days ago

I’m fairly confident this is untrue. At Microsoft at least, it’s a big deal when there is a privacy issue, even a small localized one on a single product - and creates a small firestorm.

We’ll get engineers working long hours focused on it, consulting closely with our legal and trust teams. One of the first questions we ask legal when we suspect a privacy issue is “Is this a notifiable event?”

It’s not really about getting slapped by regulators - it’s the fact that much of Microsoft’s business is built by earning the trust of large companies and small ones. Many of them are in the EU of course, but we have strict compliance we apply broadly. It’s just not worth damaging our reputation (and hurting our business) for some shortcut somewhere, as trust takes a long time to build and is easily broken.

esrauch 1132 days ago

Why would they possibly lie about that?

ChatGTP 1132 days ago

Because they do shady shit, like, by default Copilot would "sample" code for training while using it. Maybe this is no longer the default, maybe it still is, but it was the default.

This type of thing erodes trust? Why should my proprietary code be used for training by default?

I was really annoyed by this.

scrollaway 1132 days ago

OpenAI is not the same company as GitHub, and it has always been pretty clear that chats on ChatGPT are recorded and used for training (unless you now opt out).

marginalia_nu 1132 days ago

> It's almost trivial to built a bot to scrape OS code from anywhere on the web.

Seems like a logistical nightmare to me. Git repos interact spectacularly poorly with web scraping in general.

mewpmewp2 1132 days ago

I would've said you should download only archives, but really I think commits are also very important data since that shows the actual changes in the code which would be very useful to train AI to suggest changes to the code.

marginalia_nu 1132 days ago

There are valid non-evil reasons for git hosts to want to throttle and put up obstacles toward scraping as well, both via crawlers or 'git clone' or whatever. These are very expensive operations.

flockonus 1132 days ago

It appears to be the exact opposite to me, `git clone --depth 1 ...` will give you a code that you can know exactly how to parse, vs. webpages that have all sorts of semantical issues.

marginalia_nu 1132 days ago

Git clone is a very expensive operation. Git hosts generally will try to prohibit mass git clone:ing for this reason.

blowski 1132 days ago

What makes it so expensive? I’d always assumed it downloaded the .git directory statically, and the computational bits were down by the local client.

ablob 1132 days ago

I'd assume this is in relation to how much other operations cost. With 'git clone' you at least download the whole repository. Compare that to 'git fetch', which is essentially a lookup at the last-modified timestamp.

marginalia_nu 1132 days ago

Yeah. Git repositories can grow very large very quickly. A single clone here and there isn't too bad, but if you're scraping tens of thousands of projects, you can easily rack up terabytes in disk and network access.

moneywoes 1132 days ago

How so? Can’t someone just download the zip file and make a queue of downloads or does GitHub rate limit?

toastal 1132 days ago

Microsoft GitHub has access to all the commits you force pushed away or branch you deleted. We have no reason to believe that it’s actually gone with no transparency and the source code being closed.

cassianoleal 1132 days ago

> We need test cases and president

I imagine you meant "precedent".

eternalban 1132 days ago

> The comparison to the Google Books antitrust falls down completely, Google had a dominant position because it had the resources to scan all books. Anyone can build a collection of almost all open source code.

Copying a file is not the same thing as "scanning" a book. To scan you first need to get your hands on the book (the download part) and then use industrial scanners to scan them. So apple-apple comparison here is scanning <-> training & scanned collection of books <-> trained model, and finally the portals to the loot: Google Books <~> Github+VSC.

Not everyone has the resources to actually process -- that is train the 'model' -- using the publicly available 'data'. Most also don't also own Github and VSC platforms to field their model. In fact, is anyone other than microsoft in a position to both scrape OSS, train a coding AI, and then include that tool in dominant software development platforms?