Are Package Registries Holding Open-Source Hostage? | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Are Package Registries Holding Open-Source Hostage? (about.scarf.sh)
	91 points by aviaviavi 2053 days ago

22 comments

jzb 2052 days ago

FTA: "Ultimately, package registries need to align their incentives with those of maintainers."

Putting it all on the registries to come up with a viable business model and provide this to maintainers without any responsibility[1] on the part of the maintainer seems really one-sided.

It costs quite a bit of money to run something like Docker Hub or NPM. If you want something aligned software maintainers first and foremost, you want a non-profit / foundation that's got priorities aligned with the larger community and not a for-profit entity that has to justify keeping the lights on.

Kinda silly headline, too. There are many package registries, but we only see two here that have business models interfering with distribution of software. Only one that's really impeding the ability to host software elsewhere if you don't like their business model.

Docker Hub's rate limits seem unlikely to impact most usage of Docker, and people who're pulling 200 images every six hours should either seek to set up their own registry to take the load off Docker Hub or throw some money to help shoulder the costs. Even if the user's only grabbing Alpine images at 5MB per image, 200 in six hours starts to add up!

[1] Granted maintainers may do a lot of work in actually maintaining the software.

emilsedgh 2052 days ago

It costs quite a bit of money to run something like Docker Hub or NPM. If you want something aligned software maintainers first and foremost, you want a non-profit / foundation that's got priorities aligned with the larger community and not a for-profit entity that has to justify keeping the lights on.

This is an interesting thought. Linux Foundation has a good corporate backing. FSF traditionally provided the backbone in terms of compilers and userland basics. Maybe their 21st century task (and what keeps the relevant in this age) should be such infrastructure.

Apache Foundation is also an interesting candidate for this. I think they also had a good corporate backing.

realquadrant 2052 days ago

What do you see as pros and cons of having the Linux Foundation running it and what are the core features?

mcguire 2052 days ago

"Docker Hub's rate limits seem unlikely to impact most usage of Docker, and people who're pulling 200 images every six hours should either seek to set up their own registry to take the load off Docker Hub or throw some money to help shoulder the costs. Even if the user's only grabbing Alpine images at 5MB per image, 200 in six hours starts to add up!"

Maybe there's something technically wrong with the Docker model?

sp332 2052 days ago

I was amazed anyone tried to make a free Docker registry. It's like making a CDN, except instead of individual files, it's for a whole app and all of its dependencies. It's a crazy amount of data for storage and bandwidth.

twunde 2052 days ago

It's more akin to making and then running a CDN but only charging 10% of customers. I'm sure Docker the company was writing it off as a marketing expense, but if they were running this in one of the public clouds that charge for egress, they were paying out a boatload of money just in egress charges (plus more in storage costs)

wmf 2052 days ago

making and then running a CDN but only charging 10% of customers

Like CloudFlare? Freemium can be a very successful business model.

edoceo 2052 days ago

can - which leaves lots of room for doesn't

ris 2052 days ago

Well of course you've got to do that in the first place to get people to sign up to the docker model and become dependent on registries in the first place.

jrochkind1 2052 days ago

Yeah. What if someone had said "Wait, is this sustainable?" before building up a docker-based solution that. (I wonder if we can find any archived discussions wondering that, or if we're just so used to thinking "things can scale for free indefinitely on the internet" that nobody wondered?)

Now that they have it, the cost they are willing to pay is based in part on cost-of-switching. Which is pretty enormous, directly and indirectly, when entire ecosystems based on docker have been iterated.

tshaddox 2052 days ago

I wouldn’t call the need for caching to ensure good performance “something technically wrong.”

stubish 2052 days ago

A properly designed registry doesn't have to cost much to run. The costs are incurred when people use the registry for distribution, or for all the infrastructure needed to monetize the registry and track users. All a registry needs to provide is an index of URLs, signing keys, and some useful metadata to enable discovery. Trivial in its purest form, with more cost if you want advanced features like ratings or curation or your own authentication service.

autarch 2052 days ago

This article ignores the fact that for many languages, the package repository is maintained as free software by volunteers (sometimes with funding from a foundation). This includes Perl, Python, Ruby, Rust, and many others.

NPM is the odd one out, really. I don't think letting one company control a language ecosystem's single package registry is a great idea, for all the reasons that the author notes!

takluyver 2052 days ago

I agree, though it's worth noting that while volunteers can maintain the software and administer the indexes, they also rely on infrastructure provided by big corporations. E.g. the Python Package Index runs on a CDN provided by fastly, which serves hundreds of TB per day. I very much doubt the non-profit Python Software Foundation could afford that bandwidth if it wasn't an in-kind donation.

luhn 2052 days ago

It's a couple petabytes, Michael. What could it cost, $10?

Seriously though, Fastly's donation of their CDN service is generous and eases the burden on the PSF, but if push came to shove they could definitely afford the bandwidth. In 2018 they had a net income of half a million.

di 2052 days ago

Hi, PyPI maintainer and PSF director here.

There's absolutely no way the PSF could afford PyPI's bandwidth out of pocket. Last I checked our "bill" from Fastly would be close to $1.5M/month.

Also given that PyPI is critical infrastructure for millions of people and software projects, anything cheaper would not really cut it.

tedivm 2052 days ago

As of March PyPi was pushing out 300TB a day through its CDN. Ignoring the "off the shelf price" of $0.12/gb and assuming they negotiate a bulk discount driving them down to $0.05/gb that's still $15,000 a day (or just shy of $5.5 million a year). Their net income in 2018 would cover less than 10% of that bill.

true_religion 2052 days ago

Well.... they couldn't afford Fast.ly, or Akamai, or Brightcove, but they could afford any tier-2 or tier-3 CDN.

For example CDN77, would start you off at 0.016/gb if you have more than 100TB per month, without any negotiation.

If you have 300TB per day, you surely can negotiate sub-cent pricing somewhere.

No, the costs are manageable.

I think the charitable/community projects setup for developers, need to act like non-profits in other sectors and actively seek out donations.

takluyver 2052 days ago

I assume there's some difference in e.g. speed or reliability between Fastly (8 c/GB) and CDN77 (1.6 c/GB)? Even if you can negotiate it to 0.8 c/GB, you're still talking about a huge piece of the PSF's budget. They would have to either cut other expenditure (e.g. making grants) dramatically, or find a lot of new income (continuously, not just a one-off donation drive). And PyPI's bandwidth is growing rapidly [1].

If PyPI didn't have sponsor providing bandwidth, I'd guess it would implement some form of rate limiting and encourage people to mirror/cache packages much more to reduce load. I don't think it would die completely, but it would be less convenient and still cost the PSF a fair bit of money.

[1] https://twitter.com/di_codes/status/1235707819955032069

takluyver 2052 days ago

The PSF's tax returns are published on python.org, and 'revenue less expenses' for 2018 was just under $280k. IANA accountant, so maybe that's the wrong line to look at.

webology 2051 days ago

Hi, I'm a PSF Director and the PSF's Treasure as of this year. For transparency, our tax returns as of 2018 (and soon 2019) are up on https://www.python.org/psf/records/

You are right that this donation does not show up in our tax filings because they provide it to any OSS project.

autarch 2051 days ago

Also, it's not income that you would put in your 990s. Similarly, the donor cannot deduct the expense of the donation.

autarch 2052 days ago

Yes, that's a good point. The Perl repo and services are also relying on donations from various companies. That said, it's easier to switch donors than it is to switch repositories.

ForHackernews 2052 days ago

If necessary, it seems like it'd be easy for any of these package registries to ~~blackmail~~ encourage big companies into donating infrastructure.

"If you don't support us, we might accidentally forget to audit our packages and feed malware into your build pipelines. It'd truly be a shame..."

fxtentacle 2052 days ago

I just used my own docker repo right from the start, precisely to make sure it'll be on a domain that I control.

In my opinion, every open source project using a foreign-owned domain as their main distribution method is just naive. Of course you'll never be able to keep things constant, because you never had any power over that domain.

jka 2052 days ago

This is an important conversation and I can't help but think that it is repeatedly drawn in predictably valley-minded directions.

As a responsible developer your use case is likely: "I want to install package X, and know that my customers are receiving and installing that package when they perform a build"

The fact that individuals cannot self-host content has been held back by the limitations of DNS (content addressing) and IPv4 (routability) for a long time now.

What I ask is this: if you as a developer were able to self-host the libraries and applications you offer to others -- regardless of whether they're open source or proprietary -- would that not solve most of these perverse incentive situations?

- The bandwidth costs would be yours, but that would allow you to find a charging model that works

- If your software became extremely successful - beyond your own ability to pay for the bandwidth - then the companies and individuals who rely upon your software would be incentivized to step in to foot the bill

- Data about software adoption and usage would be de-rigueur shared with the providers who offer the bandwidth for it

- We would not be reliant (in a community sense, but also in a day-to-day operations sense) on the benevolent albeit often-loss-leading hosting of centralized repositories

edoceo 2052 days ago

How can we not self-host? For lots of these you can if you want. I mean, I host docker images and various packages that folks source from our infrastructure. I don't see DNS or IPv4 as a blocker

im3w1l 2052 days ago

A 13 year old dabbler cannot self host using their home connection. And that is arguably a problem.

Hell I don't know how I would self-host, I just (reluctantly) put stuff on paid-for servers. I guess you start by calling your ISP and asking if they can pretty please give you a static IPV4?

edoceo 2052 days ago

Ok, I get it. I host on a cloud VPS which costs $$/mo - I don't consider hosting on home system a blocker.

Even still, you don't need a static. Easy to do DNS map and punch a hole in your FW or have some DMZ.

There is not a way to solve for zero-cash and zero-work

jka 2051 days ago

You got it, yep; I should've been more precise about what I meant by self-hosting.

Smartphones and data plans would provide the bare necessities for self-hosting; in many cases there's abundant computing and bandwidth resource available to them. Many packages/containers are small and downloaded infrequently - any much data capacity, I expect, goes unused.

If and when smartphone-hosted resources become insufficient, a cloud VPS like your approach would be a next logical upgrade. Beyond that, corporate/foundation-based sponsorship and dedicated servers and bandwidth.

The key would be to make it near-seamless to migrate between those different environments. Namespaced source code repositories and packages appear to have worked well for the likes of GitHub, GitLab, NPM and Docker Hub, so perhaps following similar conceptual design ideas would make sense.

A few areas of concern would be:

- How do you keep end-user devices safe if they will be hosting content for a wide audience?

- How do you react to credentials and other protected content being posted if repositories themselves are a distributed network?

- How do you achieve discoverability and search of content in a distributed environment?

... not to mention whether the effort and migration to such a model is worth the benefits.

I tend to think it would be, since it aligns the incentives around spending, increases hardware utilization, and increases resilience by removing single points of failure.

That said, I also imagine there are well-founded and sincere arguments for continued centralized code and container hosting that are valid and worthwhile (not least of which: it's where we are, and it's relatively straightforward to reason about).

jka 2051 days ago

tl;dr - GitTorrent[1][2], and more recently, radicle[3]

[1] - https://blog.printf.net/articles/2015/05/29/announcing-gitto...

[2] - https://hn.algolia.com/?q=GitTorrent

[3] - https://radicle.xyz/

im3w1l 2052 days ago

From what I heard, hosting on dynamic IP is unreliable.

jrockway 2052 days ago

The problem with Docker Hub is that they don't let you choose the pricing model. In the end, bandwidth and CPUs are not free, so someone has to pay. For a while it was VCs hoping for growth, but we all know that giving stuff away for free is not sustainable forever.

The problem with Docker Hub's pricing model is there are actually two use cases for Docker Hub. One is where some random entity makes some software -- they don't have any money, so it makes sense for the user to pay to download it. It's cheaper than setting up your own CI and hosting to build images, after all. That's the only pricing model that Docker Hub supports right now. It may be annoying that it used to be free, but that was just an accident -- you should have been paying for Docker Hub pulls from day one.

The other model is where some commercial entity wants to distribute software to their users. In that case, they'd be happy to pay for their users to anonymously download it. But, Docker Hub doesn't support this particular model, and that's what's causing a lot of problems. (Where I work, this is already a problem for our customers. I think we're going to move to GCR, and like the article mentions, this is a pain because we are locked in -- you can't make Docker Hub 302 to your new host. Everyone has to rewrite their configuration to pick up the new registry, all because Docker won't let us pay for their pulls!)

The article also talks about NPM being problematic. I kind of agree with this; I spend a lot of time waiting for my CI system to re-pull node modules. (Because everything is done in containers, nothing is preserved between builds. Things can be preserved if you want them to be, but on CircleCI it's actually faster to re-pull all the modules from the Internet than to restore from their own caching mechanism. Shrug.) I had this problem with Go when modules first came out (our builds were getting ratelimited by Github because a lot of our dependencies happened to be hosted there), but it was easy enough to set up a local module proxy, and it was never a problem for us again. (Go has since decided to run their own module proxy, and I stopped running my own. Theirs is great! But obviously, someday it won't be free anymore, but at least running your own is a first-class option.) NPM, it seems, have never really offered this as an option -- they've never rate limited me, and their documentation says "don't worry about it!" which of course makes me worry more ;)

TuringNYC 2052 days ago

>> The other model is where some commercial entity wants to distribute software to their users. In that case, they'd be happy to pay for their users to anonymously download it.

This has hit us big-time last week -- all our k8s components are hosted on DockerHub and our firm is happy to do publisher-paid pulls, but it seems to not be an option. Instead, every customer is expected to get a paid account (if they pass some download threshold, which they may or may not.)

What are others doing? We're thinking of moving to Azure Container Registry or Amazon Elastic Container Registry. We're happy to pay just to ensure customers are not randomly throttled, though I wish we could just keep it all in DockerHub (we'd be fine paying DockerHub for customer usage, but we cant force customers to do so themselves.)

ArchOversight 2052 days ago

This comment may provide some hope: https://news.ycombinator.com/item?id=25061202

justincormack 2052 days ago

There are publisher pays plans available for Docker Hub, email pricingquestions@docker.com for details.

jrockway 2052 days ago

That's great to hear. I will be reaching out :)

zajio1am 2052 days ago

This is a common pattern how to break pricing mechanism and prevent competition - separate choosers of the service from payers of the service. In this case free hosting is offered to publishers, they choose to host their images there and then users who download them are asked to pay. Had publishers been asked to pay, they would select appropriate registry and perhaps self-host one rather than to use third-party one.

ChrisMarshallNY 2052 days ago

I use a number of dependencies.

Most, I wrote, myself, maintain on GitHub, and include as Swift Package Manager dependencies. I write in a modular, layered fashion, and try to fork off as many components as possible into standalone projects.

I'm very, very careful about including third-party dependencies. I think that these are the only ones that I use, throughout my projects:

    SOAPEngine (Paid)
    ffmpeg
    VLCLib
    SwiftKeychain

The first, I downloaded and installed directly into my repo (no live link), the two video libs, I use Carthage to include from their home repos, and the last, SPM (also from the home repo). No real registries. I am not a fan of CocoaPods. I use Homebrew for some dev utils on my computer, but the above list is what ships.

I may have one more, somewhere, but I can't remember, and I'm too lazy to look. We can rest assured that it was not lightly added.

LockAndLol 2052 days ago

We have IPFS and the code for hosting most registries is open-source. If the opensource community really wanted to / got annoyed enough, it would devise a system that used those components to make a distributed package registry.

It's easy to complain, it's more difficult to work on solutions. We should all be doing more of the latter (working on solutions).

Qwertious 2052 days ago

Working on solutions does nothing if you're not working on the right solution. There's nothing quite as useful as a really precise complaint.

LockAndLol 2051 days ago

I disagree. A precise bug report is good, a "precise complaint" is a mere opinion. It feels really nice to write one and people like patting themselves on their back, but opinions are like assholes, everybody has one.

Personally, I'd much rather see a solution to a problem than a complaint. The solution might involve discussions that go back and forth, but if they culminate in a decision on a way forward with a person willing to do the work, they are much more useful than "your code sucks on line".

Complaining isn't contributing.

Edit: also, working on a solution doesn't preclude discussion on the proper solution.

thrower123 2052 days ago

Relying on generosity when there are infrastructure costs does not seem to be a workable model. It's somewhat amazing that it has survived so long.

vageli 2052 days ago

> Relying on generosity when there are infrastructure costs does not seem to be a workable model. It's somewhat amazing that it has survived so long.

I too was initially surprised but then I thought of github (this was also the case pre-microsoft acquisition), sourceforge, gitlab, etc who all seem fine with me downloading/cloning as many repos as I want without charge.

908B64B197 2052 days ago

Question: What keeps an organization from hosting it's own package mirror internally and only periodically fetch the diffs from the central registry?

hinkley 2052 days ago

There are tools for doing this, but it's a matter of cost and complexity to deal with them.

Artifactory seems to have a pretty big chunk of this vertical. It supports a few different repository protocols, so it serves as a bit of a one-stop shop that survives technology changes.

908B64B197 2052 days ago

If you are fetching multiple GB of images over the network it kinda make sense.

Someone 2052 days ago

Way more than “kinda”. If you have a continuous integration pipeline that checks out projects from scratch (as it should), every build fetches all dependencies, transitively.

Even ignoring download costs, a local cache (one of the functions of an artifactory) helps speed up those downloads and,with it, your builds. It probably also helps against getting blacklisted by code repositories.

An artifactory also automatically backs up any libraries you use. That protects against them disappearing from the internet.

hinkley 2052 days ago

I think the first wave of artifactory customers was also populated by companies with limited network connectivity. It’s nuts to run a Rails or J2EE project if your company is using a pair of 1MB modems for all traffic, even if the dependencies are relatively small. Branch offices are similarly hamstrung. That was part of Perforce’s customer base as well, since they could run a local proxy for source code.

As you get into CI/CD you start to notice that your upstream repo is occasionally down, because it’s getting in the way of some deadline.

neves 2052 days ago

We use Nexus and cache all of our packages, but it is one more system to maintain and update. Sure Nexus is a great asset, almost never gave us trouble.

robert_dipaolo 2052 days ago

AWS has a managed service CodeArtifact that supports all the common code package repos and allows caching of upstream repos. Granted it doesn't work with Docker Images, but you asked about packages.

TuringNYC 2052 days ago

Someone would have to support it 24x7 and we could never get the uptime of DockerHub/ACS/ECS. Since a Production k8s deployment could spin up an instance at any time of day, some type of 5-9 or at least 4-9 uptime is pretty important.

908B64B197 2052 days ago

I see.

I guess you could still fall back to the main package source if the local mirror is down.

mschuster91 2052 days ago

As long as you're not doing push stuff:

1) set up a series of N docker registry mirrors in pull-through mode (https://docs.docker.com/registry/recipes/mirror/, it's as simple as "docker run --rm --name registry -d -p 5000:5000 -e REGISTRY_STORAGE_DELETE_ENABLED=true -e REGISTRY_PROXY_REMOTEURL=https://registry-1.docker.io -v /mnt/persistentdata/registry:/var/lib/registry registry")

2) expose them on the same domain name (multiple A records, loadbalancers, whatever you want)

3) set them as mirrors in each machine's docker daemon

In case one of your mirrors go down, take them out of the DNS/LB rotation. That's it.

wmf 2052 days ago

Why is no one talking about this solution?

ArchOversight 2052 days ago

Cause users still need to update where they are pulling from.

There'd need to be a way within docker to alias to the new URL so that what normally would go to docker hub ends up pulling from the mirror.

vorpalhex 2052 days ago

Nothing. We did this when NPM was having issues and it worked very well for us, we also did this for some non-US team mates who had very poor NPM performance.

It runs well, is easy to keep up and working and generally was awesome.

TazeTSchnitzel 2052 days ago

Maybe a content-addressable P2P web where everyone has their own cache would be better as a way of distributing packages. It would spare a lot of bandwidth costs for the hosts and maybe make us less dependant on big corporate benefactors.

eins1234 2052 days ago

Definitely one use case for IPFS I'm personally super excited about.

One objection that I'm sure will come up is "what if people stop hosting a package you depend on":

That's where dedicated package hosting services (like npm) can come in and provide a reliable source of these packages that's fast and always available (potentially for a price). The benefit over the status quo is those services will be commoditized, so they have to compete purely on price/reliability, since as a user you don't have to care where the packages come from in a content-addressed system.

RabbitmqGuy 2052 days ago

The Golang package dependency story has been bumpy, but one thing they got absolutely right is not having a package registry.

Anyone can host their golang package at example.com/myPkg and Go wouldn't care less what/who runs example.com

diegof79 2052 days ago

I don’t know the specifics about Docker, but at least for NPM the article got some of the things wrong:

> ..npm, and other comparable registries are incentivized to create lock-in

The author’s conclusion doesn’t match the actions of NPM. There are open source implementations of both client and server, you can run your own NPM registry, and you can change the default registry easily.

> What makes npm’s particular scenario even worse is that they've made it so difficult to use a registry that is not npm

Uh?! I’m using Verdaccio (an open source Npm registry) everyday, the setup is extremely easy. Also yarn uses their own registry by default.

I don’t see where is the “lock in” in NPM.

Edit: typo

phkahler 2052 days ago

What about self hosting? Anything but the most popular projects should be possible to host on a Raspberry Pi with static web pages and torrents for large downloads. Off your own home network...

jka 2052 days ago

For anyone interested in opting out of scarf's analytics when installing NPM packages in any of your environments:

https://github.com/scarf-sh/scarf-js#as-a-user-of-a-package-...

houdinifxtd 2051 days ago

It would be wonderful to keep the service free, but it's also a fair point that the data usage must be tremendous.

Perhaps a balance could be achieved by rate limiting by IP to encourage caching by devs. I'm guessing a fair amount of waste occurs from CI testing. Normally once an end user has the image that should be enough.

choeger 2052 days ago

It works well with youtube. Why not do the same for software? Advertisement is obviously not an option but one could easily ask companies to pay a few k$ a year for professional access. Then just take 30% or so as the platform and redistribute the rest among the uploaders.

Animats 2052 days ago

Yes, finding a place where you can cut off someone's air supply is a good business model.

a1369209993 2052 days ago

More, to the point, a even better business model is to make such a place, manipulate people into it, then pretend you're just charging fair market price for air.

based2 2052 days ago

And why not, a model where unsecure packages are not accessible anymore.

eeZah7Ux 2052 days ago

And that's what you get for using corporate-drive open source.

ananonymoususer 2052 days ago

Of course you COULD just pay them a reasonable fee for their service.

remram 2052 days ago

The users could, but what can you do as a maintainer if they don't?

api 2052 days ago

Nah, FOSS in 2020 is "give me free stuff slave!" while meanwhile making money off it by using it to run for-profit SaaS.

rodgerd 2052 days ago

Another "I want free shit, I want it to be unlimited, have perfect quality, and I want it yesterday" whine.