FTA: "Ultimately, package registries need to align their incentives with those of maintainers."
Putting it all on the registries to come up with a viable business model and provide this to maintainers without any responsibility[1] on the part of the maintainer seems really one-sided.
It costs quite a bit of money to run something like Docker Hub or NPM. If you want something aligned software maintainers first and foremost, you want a non-profit / foundation that's got priorities aligned with the larger community and not a for-profit entity that has to justify keeping the lights on.
Kinda silly headline, too. There are many package registries, but we only see two here that have business models interfering with distribution of software. Only one that's really impeding the ability to host software elsewhere if you don't like their business model.
Docker Hub's rate limits seem unlikely to impact most usage of Docker, and people who're pulling 200 images every six hours should either seek to set up their own registry to take the load off Docker Hub or throw some money to help shoulder the costs. Even if the user's only grabbing Alpine images at 5MB per image, 200 in six hours starts to add up!
[1] Granted maintainers may do a lot of work in actually maintaining the software.
It costs quite a bit of money to run something like Docker Hub or NPM. If you want something aligned software maintainers first and foremost, you want a non-profit / foundation that's got priorities aligned with the larger community and not a for-profit entity that has to justify keeping the lights on.
This is an interesting thought. Linux Foundation has a good corporate backing. FSF traditionally provided the backbone in terms of compilers and userland basics. Maybe their 21st century task (and what keeps the relevant in this age) should be such infrastructure.
Apache Foundation is also an interesting candidate for this. I think they also had a good corporate backing.
"Docker Hub's rate limits seem unlikely to impact most usage of Docker, and people who're pulling 200 images every six hours should either seek to set up their own registry to take the load off Docker Hub or throw some money to help shoulder the costs. Even if the user's only grabbing Alpine images at 5MB per image, 200 in six hours starts to add up!"
Maybe there's something technically wrong with the Docker model?
I was amazed anyone tried to make a free Docker registry. It's like making a CDN, except instead of individual files, it's for a whole app and all of its dependencies. It's a crazy amount of data for storage and bandwidth.
It's more akin to making and then running a CDN but only charging 10% of customers. I'm sure Docker the company was writing it off as a marketing expense, but if they were running this in one of the public clouds that charge for egress, they were paying out a boatload of money just in egress charges (plus more in storage costs)
Well of course you've got to do that in the first place to get people to sign up to the docker model and become dependent on registries in the first place.
Yeah. What if someone had said "Wait, is this sustainable?" before building up a docker-based solution that. (I wonder if we can find any archived discussions wondering that, or if we're just so used to thinking "things can scale for free indefinitely on the internet" that nobody wondered?)
Now that they have it, the cost they are willing to pay is based in part on cost-of-switching. Which is pretty enormous, directly and indirectly, when entire ecosystems based on docker have been iterated.
A properly designed registry doesn't have to cost much to run. The costs are incurred when people use the registry for distribution, or for all the infrastructure needed to monetize the registry and track users. All a registry needs to provide is an index of URLs, signing keys, and some useful metadata to enable discovery. Trivial in its purest form, with more cost if you want advanced features like ratings or curation or your own authentication service.
This article ignores the fact that for many languages, the package repository is maintained as free software by volunteers (sometimes with funding from a foundation). This includes Perl, Python, Ruby, Rust, and many others.
NPM is the odd one out, really. I don't think letting one company control a language ecosystem's single package registry is a great idea, for all the reasons that the author notes!
I agree, though it's worth noting that while volunteers can maintain the software and administer the indexes, they also rely on infrastructure provided by big corporations. E.g. the Python Package Index runs on a CDN provided by fastly, which serves hundreds of TB per day. I very much doubt the non-profit Python Software Foundation could afford that bandwidth if it wasn't an in-kind donation.
It's a couple petabytes, Michael. What could it cost, $10?
Seriously though, Fastly's donation of their CDN service is generous and eases the burden on the PSF, but if push came to shove they could definitely afford the bandwidth. In 2018 they had a net income of half a million.
As of March PyPi was pushing out 300TB a day through its CDN. Ignoring the "off the shelf price" of $0.12/gb and assuming they negotiate a bulk discount driving them down to $0.05/gb that's still $15,000 a day (or just shy of $5.5 million a year). Their net income in 2018 would cover less than 10% of that bill.
I assume there's some difference in e.g. speed or reliability between Fastly (8 c/GB) and CDN77 (1.6 c/GB)? Even if you can negotiate it to 0.8 c/GB, you're still talking about a huge piece of the PSF's budget. They would have to either cut other expenditure (e.g. making grants) dramatically, or find a lot of new income (continuously, not just a one-off donation drive). And PyPI's bandwidth is growing rapidly [1].
If PyPI didn't have sponsor providing bandwidth, I'd guess it would implement some form of rate limiting and encourage people to mirror/cache packages much more to reduce load. I don't think it would die completely, but it would be less convenient and still cost the PSF a fair bit of money.
The PSF's tax returns are published on python.org, and 'revenue less expenses' for 2018 was just under $280k. IANA accountant, so maybe that's the wrong line to look at.
Hi, I'm a PSF Director and the PSF's Treasure as of this year. For transparency, our tax returns as of 2018 (and soon 2019) are up on https://www.python.org/psf/records/
You are right that this donation does not show up in our tax filings because they provide it to any OSS project.
Yes, that's a good point. The Perl repo and services are also relying on donations from various companies. That said, it's easier to switch donors than it is to switch repositories.
I just used my own docker repo right from the start, precisely to make sure it'll be on a domain that I control.
In my opinion, every open source project using a foreign-owned domain as their main distribution method is just naive. Of course you'll never be able to keep things constant, because you never had any power over that domain.
This is an important conversation and I can't help but think that it is repeatedly drawn in predictably valley-minded directions.
As a responsible developer your use case is likely: "I want to install package X, and know that my customers are receiving and installing that package when they perform a build"
The fact that individuals cannot self-host content has been held back by the limitations of DNS (content addressing) and IPv4 (routability) for a long time now.
What I ask is this: if you as a developer were able to self-host the libraries and applications you offer to others -- regardless of whether they're open source or proprietary -- would that not solve most of these perverse incentive situations?
- The bandwidth costs would be yours, but that would allow you to find a charging model that works
- If your software became extremely successful - beyond your own ability to pay for the bandwidth - then the companies and individuals who rely upon your software would be incentivized to step in to foot the bill
- Data about software adoption and usage would be de-rigueur shared with the providers who offer the bandwidth for it
- We would not be reliant (in a community sense, but also in a day-to-day operations sense) on the benevolent albeit often-loss-leading hosting of centralized repositories
How can we not self-host? For lots of these you can if you want. I mean, I host docker images and various packages that folks source from our infrastructure. I don't see DNS or IPv4 as a blocker
A 13 year old dabbler cannot self host using their home connection. And that is arguably a problem.
Hell I don't know how I would self-host, I just (reluctantly) put stuff on paid-for servers. I guess you start by calling your ISP and asking if they can pretty please give you a static IPV4?
You got it, yep; I should've been more precise about what I meant by self-hosting.
Smartphones and data plans would provide the bare necessities for self-hosting; in many cases there's abundant computing and bandwidth resource available to them. Many packages/containers are small and downloaded infrequently - any much data capacity, I expect, goes unused.
If and when smartphone-hosted resources become insufficient, a cloud VPS like your approach would be a next logical upgrade. Beyond that, corporate/foundation-based sponsorship and dedicated servers and bandwidth.
The key would be to make it near-seamless to migrate between those different environments. Namespaced source code repositories and packages appear to have worked well for the likes of GitHub, GitLab, NPM and Docker Hub, so perhaps following similar conceptual design ideas would make sense.
A few areas of concern would be:
- How do you keep end-user devices safe if they will be hosting content for a wide audience?
- How do you react to credentials and other protected content being posted if repositories themselves are a distributed network?
- How do you achieve discoverability and search of content in a distributed environment?
... not to mention whether the effort and migration to such a model is worth the benefits.
I tend to think it would be, since it aligns the incentives around spending, increases hardware utilization, and increases resilience by removing single points of failure.
That said, I also imagine there are well-founded and sincere arguments for continued centralized code and container hosting that are valid and worthwhile (not least of which: it's where we are, and it's relatively straightforward to reason about).
The problem with Docker Hub is that they don't let you choose the pricing model. In the end, bandwidth and CPUs are not free, so someone has to pay. For a while it was VCs hoping for growth, but we all know that giving stuff away for free is not sustainable forever.
The problem with Docker Hub's pricing model is there are actually two use cases for Docker Hub. One is where some random entity makes some software -- they don't have any money, so it makes sense for the user to pay to download it. It's cheaper than setting up your own CI and hosting to build images, after all. That's the only pricing model that Docker Hub supports right now. It may be annoying that it used to be free, but that was just an accident -- you should have been paying for Docker Hub pulls from day one.
The other model is where some commercial entity wants to distribute software to their users. In that case, they'd be happy to pay for their users to anonymously download it. But, Docker Hub doesn't support this particular model, and that's what's causing a lot of problems. (Where I work, this is already a problem for our customers. I think we're going to move to GCR, and like the article mentions, this is a pain because we are locked in -- you can't make Docker Hub 302 to your new host. Everyone has to rewrite their configuration to pick up the new registry, all because Docker won't let us pay for their pulls!)
The article also talks about NPM being problematic. I kind of agree with this; I spend a lot of time waiting for my CI system to re-pull node modules. (Because everything is done in containers, nothing is preserved between builds. Things can be preserved if you want them to be, but on CircleCI it's actually faster to re-pull all the modules from the Internet than to restore from their own caching mechanism. Shrug.) I had this problem with Go when modules first came out (our builds were getting ratelimited by Github because a lot of our dependencies happened to be hosted there), but it was easy enough to set up a local module proxy, and it was never a problem for us again. (Go has since decided to run their own module proxy, and I stopped running my own. Theirs is great! But obviously, someday it won't be free anymore, but at least running your own is a first-class option.) NPM, it seems, have never really offered this as an option -- they've never rate limited me, and their documentation says "don't worry about it!" which of course makes me worry more ;)
>> The other model is where some commercial entity wants to distribute software to their users. In that case, they'd be happy to pay for their users to anonymously download it.
This has hit us big-time last week -- all our k8s components are hosted on DockerHub and our firm is happy to do publisher-paid pulls, but it seems to not be an option. Instead, every customer is expected to get a paid account (if they pass some download threshold, which they may or may not.)
What are others doing? We're thinking of moving to Azure Container Registry or Amazon Elastic Container Registry. We're happy to pay just to ensure customers are not randomly throttled, though I wish we could just keep it all in DockerHub (we'd be fine paying DockerHub for customer usage, but we cant force customers to do so themselves.)
This is a common pattern how to break pricing mechanism and prevent competition - separate choosers of the service from payers of the service. In this case free hosting is offered to publishers, they choose to host their images there and then users who download them are asked to pay. Had publishers been asked to pay, they would select appropriate registry and perhaps self-host one rather than to use third-party one.
Most, I wrote, myself, maintain on GitHub, and include as Swift Package Manager dependencies. I write in a modular, layered fashion, and try to fork off as many components as possible into standalone projects.
I'm very, very careful about including third-party dependencies. I think that these are the only ones that I use, throughout my projects:
SOAPEngine (Paid)
ffmpeg
VLCLib
SwiftKeychain
The first, I downloaded and installed directly into my repo (no live link), the two video libs, I use Carthage to include from their home repos, and the last, SPM (also from the home repo). No real registries. I am not a fan of CocoaPods. I use Homebrew for some dev utils on my computer, but the above list is what ships.
I may have one more, somewhere, but I can't remember, and I'm too lazy to look. We can rest assured that it was not lightly added.
We have IPFS and the code for hosting most registries is open-source. If the opensource community really wanted to / got annoyed enough, it would devise a system that used those components to make a distributed package registry.
It's easy to complain, it's more difficult to work on solutions. We should all be doing more of the latter (working on solutions).
I disagree. A precise bug report is good, a "precise complaint" is a mere opinion. It feels really nice to write one and people like patting themselves on their back, but opinions are like assholes, everybody has one.
Personally, I'd much rather see a solution to a problem than a complaint. The solution might involve discussions that go back and forth, but if they culminate in a decision on a way forward with a person willing to do the work, they are much more useful than "your code sucks on line".
Complaining isn't contributing.
Edit: also, working on a solution doesn't preclude discussion on the proper solution.
> Relying on generosity when there are infrastructure costs does not seem to be a workable model. It's somewhat amazing that it has survived so long.
I too was initially surprised but then I thought of github (this was also the case pre-microsoft acquisition), sourceforge, gitlab, etc who all seem fine with me downloading/cloning as many repos as I want without charge.
There are tools for doing this, but it's a matter of cost and complexity to deal with them.
Artifactory seems to have a pretty big chunk of this vertical. It supports a few different repository protocols, so it serves as a bit of a one-stop shop that survives technology changes.
Way more than “kinda”. If you have a continuous integration pipeline that checks out projects from scratch (as it should), every build fetches all dependencies, transitively.
Even ignoring download costs, a local cache (one of the functions of an artifactory) helps speed up those downloads and,with it, your builds. It probably also helps against getting blacklisted by code repositories.
An artifactory also automatically backs up any libraries you use. That protects against them disappearing from the internet.
I think the first wave of artifactory customers was also populated by companies with limited network connectivity. It’s nuts to run a Rails or J2EE project if your company is using a pair of 1MB modems for all traffic, even if the dependencies are relatively small. Branch offices are similarly hamstrung. That was part of Perforce’s customer base as well, since they could run a local proxy for source code.
As you get into CI/CD you start to notice that your upstream repo is occasionally down, because it’s getting in the way of some deadline.
We use Nexus and cache all of our packages, but it is one more system to maintain and update. Sure Nexus is a great asset, almost never gave us trouble.
AWS has a managed service CodeArtifact that supports all the common code package repos and allows caching of upstream repos. Granted it doesn't work with Docker Images, but you asked about packages.
Someone would have to support it 24x7 and we could never get the uptime of DockerHub/ACS/ECS. Since a Production k8s deployment could spin up an instance at any time of day, some type of 5-9 or at least 4-9 uptime is pretty important.
1) set up a series of N docker registry mirrors in pull-through mode (https://docs.docker.com/registry/recipes/mirror/, it's as simple as "docker run --rm --name registry -d -p 5000:5000 -e REGISTRY_STORAGE_DELETE_ENABLED=true -e REGISTRY_PROXY_REMOTEURL=https://registry-1.docker.io -v /mnt/persistentdata/registry:/var/lib/registry registry")
2) expose them on the same domain name (multiple A records, loadbalancers, whatever you want)
3) set them as mirrors in each machine's docker daemon
In case one of your mirrors go down, take them out of the DNS/LB rotation. That's it.
Nothing. We did this when NPM was having issues and it worked very well for us, we also did this for some non-US team mates who had very poor NPM performance.
It runs well, is easy to keep up and working and generally was awesome.
Maybe a content-addressable P2P web where everyone has their own cache would be better as a way of distributing packages. It would spare a lot of bandwidth costs for the hosts and maybe make us less dependant on big corporate benefactors.
Definitely one use case for IPFS I'm personally super excited about.
One objection that I'm sure will come up is "what if people stop hosting a package you depend on":
That's where dedicated package hosting services (like npm) can come in and provide a reliable source of these packages that's fast and always available (potentially for a price). The benefit over the status quo is those services will be commoditized, so they have to compete purely on price/reliability, since as a user you don't have to care where the packages come from in a content-addressed system.
I don’t know the specifics about Docker, but at least for NPM the article got some of the things wrong:
> ..npm, and other comparable registries are incentivized to create lock-in
The author’s conclusion doesn’t match the actions of NPM. There are open source implementations of both client and server, you can run your own NPM registry, and you can change the default registry easily.
> What makes npm’s particular scenario even worse is that they've made it so difficult to use a registry that is not npm
Uh?! I’m using Verdaccio (an open source Npm registry) everyday, the setup is extremely easy. Also yarn uses their own registry by default.
What about self hosting? Anything but the most popular projects should be possible to host on a Raspberry Pi with static web pages and torrents for large downloads. Off your own home network...
It would be wonderful to keep the service free, but it's also a fair point that the data usage must be tremendous.
Perhaps a balance could be achieved by rate limiting by IP to encourage caching by devs. I'm guessing a fair amount of waste occurs from CI testing. Normally once an end user has the image that should be enough.
It works well with youtube. Why not do the same for software? Advertisement is obviously not an option but one could easily ask companies to pay a few k$ a year for professional access. Then just take 30% or so as the platform and redistribute the rest among the uploaders.
More, to the point, a even better business model is to make such a place, manipulate people into it, then pretend you're just charging fair market price for air.
Putting it all on the registries to come up with a viable business model and provide this to maintainers without any responsibility[1] on the part of the maintainer seems really one-sided.
It costs quite a bit of money to run something like Docker Hub or NPM. If you want something aligned software maintainers first and foremost, you want a non-profit / foundation that's got priorities aligned with the larger community and not a for-profit entity that has to justify keeping the lights on.
Kinda silly headline, too. There are many package registries, but we only see two here that have business models interfering with distribution of software. Only one that's really impeding the ability to host software elsewhere if you don't like their business model.
Docker Hub's rate limits seem unlikely to impact most usage of Docker, and people who're pulling 200 images every six hours should either seek to set up their own registry to take the load off Docker Hub or throw some money to help shoulder the costs. Even if the user's only grabbing Alpine images at 5MB per image, 200 in six hours starts to add up!
[1] Granted maintainers may do a lot of work in actually maintaining the software.