Hacker News new | ask | show | jobs
by mrweasel 1204 days ago
I'm disappointed that this is an issue for some package management systems. 20 years ago I helped run a mirroring service, it's still running today. Distributions such as Debian have hundreds of mirrors. This is a solved problems, but we just decided to but everything in the hands of one for-profit company.
4 comments

I can’t speak for homebrew, but cargo/crate.io is fundamentally different from debian. The rate of churn is significantly higher, packages are published constantly and people expect them to be available. You can’t really do that with a system of mirrors like debian does. You can do edge caching and crates does that, but you want some central authority of which package versions are available. And every cargo run queries that index.

It’s acceptable for debian mirrors lag a few minutes or hours behind. The same thing is much harder to accept when the rate of change is much higher. Different requirements, different tradeoffs.

I guess I don't really understand the need for something like Cargo to be up to date to the seconds, or even minutes. My assumption is that you build your code with certain package versions in mind and release that to testing. Unless it's a security update, if won't matter if your 24 hours behind.

Say that version 1.2.1 of a library is release right as you do your build, that won't go into production within that 24 hour window anyway. If it is a security fix, then, like Debian, you pull that from another repository, which is under tighter control.

This thread is glossing over some important details. A package repo has two distinct storage concerns: the index (a list of which versions of packages have ever been released), and the actual packages themselves. It's convenient to have the index centralized, for maximum consistency. But the packages themselves can be stored however you want, and if you try to access a stale mirror without the most recent version of a package, then the client should have the option of using a different mirror or else accepting the old version.

For crates.io specifically, the packages are stored in S3, whereas the index is currently stored as a bog-standard Github repo (not as a Github Package), and in the near future the crates.io index will also move to crates.io itself (https://blog.rust-lang.org/inside-rust/2023/01/30/cargo-spar...).

Thanks, that wasn't clear to me. Why not just dump the index on the same storage as the packages? If text files are insufficient then do an SQLite database.
There are sound technical reasons to give the index special treatment.

First, the index is very large, and it only ever gets larger over time. I just cloned and compressed the crates.io index (https://github.com/rust-lang/crates.io-index), which resulted in a 58 MB archive (note that I did remember to delete the .git directory).

Second, the index changes very often. Every time anyone ever publishes a new version of a package, that changes the index. For crates.io, this happens hundreds or thousands of times per day.

Third, the index is append-only.

Fourth, the index is extremely frequently requested. Any time the user manually asks for an update, or any time the user adds a new dependency, the local copy of the index needs to be updated.

Putting it all together, since the index is constantly changing and since users will constantly be asking for the latest version, this means that it would be very inefficient to serve the whole thing each time. Instead, a fine-grained solution is more efficient. In the early days of crates.io, this problem was solved by just storing the index in a git repo and letting git take care of fetching new diffs to the index (and the problem of "who pays for hosting" was solved by using Github). Now that the crates.io index is outgrowing this solution, it's moving to a more involved protocol where clients will not have local copies of the full index, but instead will only lazily fetch individual index entries as necessary, which is much faster (especially for fresh installs (including every CI run!)).

I think in the case of Debian, packages are vetted and approved by repository maintainers before being hosted (the repository is curated). I think most application dependency repositories let anyone in and the onus is on the author and user to determine the legitimacy.

I imagine it's easier to get people to mirror curated, signed packages than, effectively, random code

i definitely push stuff to npm and then pull it in as dep on a different project seconds later. mostly because I'm too lazy to eff around with local package resolution which has bitten me before and also implies you're linking against live code instead of a specific snapshot
> cargo/crate.io is fundamentally different from debian

Given the OP, note that packages on crates.io don't (and can't) reference Github. Crates.io has its own storage, and the only way to upload a crate to crates.io is if 100% of its dependencies are also on crates.io.

Right. Although crates.io links to Github repositories, it doesn't get the code from them. They can be out of sync, which caused me some trouble yesterday.
Indeed, anyone can list whatever URL they want as the "repository" on the crates.io page for any page they link. There's not much of an alternative, given that crates.io is designed to be immutable, and the internet in general is not. (At best, crates.io could provide a link to a browser-rendered directory tree of the code that crates.io has on hand for any given version.)
If only there were some way to make git distributed!

/s

You mean something like a git annex enabled branch tracking mirror locations of each release artifact like HTTP URLs, (webseeded) torrents, maybe even something content addressed like IPFS? sigh
uhhh wait can you explain what you mean?
Move fast and break things ... /s
just yesterday I was stuck for an hour because a debian package mirror went down. took a long time to talk the user through changing their sources.list so that another mirror was chosen, and the mirror chosen out of that pool was down also. finally I had to manually check for a good mirror and give them the URLs.

the user's take was "why don't they use GitHub packages?"

"still running today" doesn't mean 100.0% uptime.

This is not how this works anymore. The system that is behaving this way must be relatively old at this point since almost all modern Debian based distros use the "mirror://" URI syntax now that automatically falls back to another mirror if one fails.
I don't think a clean Debian stable install uses that today.

But even so, at least the mirrorlist.txt file that appears in the mirror:// URI must be available for it to work, right?

You are correct. While it's supported and part of the APT version in Debian, they don't make much use of it themselves, whereas most downstream distros are making use of it (eg. Ubuntu)

https://manpages.debian.org/bullseye/apt/apt-transport-mirro...

You can still use it in vanilla Debian, but they don't make their mirror list available easily in the correct format, so you would have to basically curl + awk the URLs into a text file and use that.

My guess is that Debian itself probably sees less than 1% of the traffic on their mirrors compared to Ubuntu and they haven't been as motivated to make this change.