Hacker News new | ask | show | jobs
by nyoomboom 1519 days ago
How does cargo prevent reproducible builds?
1 comments

The default behavior of cargo is to download stuff from the internet. This may be the least reproducible thing ever.

I'm honestly astonished that programmers of a language that is deemed to be "safe by default" thought that this behavior was acceptable in any form, not to say the default. If downloading things at build time is somehow necessary, it should be an obscure option behind a flag with a scary name, like --extremely-unsafe-i-know-what-i-am-doing, that prompted the user with a small turing test every time that it is run. Cargo is just bonkers, it doesn't matter at all if it is "convenient" or not. Convenience before basic safety and reproducibility is contrary to the spirit of the language itself.

It's as if bounds checking in the language was deferred to a third party that you need to "trust" in order to believe that you won't have segmentation faults.

It doesn't just download random things. Cargo generates a Cargo.lock file with checksums and will make sure that those checksums match when building later on. It's about as safe as vendoring all dependencies while being far easier to work with (though tools like cargo-vendor do exist, of course).

Edit: for things like the kernel, vendoring dependencies is still probably not a bad idea, of course

What prevents a given URL from disappearing? Does that just break a particular source version of the Linux kernel?

What happens when a given dependency adds new kernel-inappropriate features? Are kernel devs going to act like distro maintainers and decide between forking, maintaining patch sets, etc.?

All crate sources are stored in the crates.io package archive, which never deletes packages.

A dependency veering off in a direction you don't like is one of the risks of using someone else's code instead of writing it yourself. Cargo makes it easy to use forked dependencies, and forking a dependency is almost always less work than if you'd never used it and written the code yourself from the beginning. (And to be clear this is only a problem for future evolution; a crate author cannot remove or modify an already-published version of their crate.)

This is still fairly short sighted. Websites shut down, large websites with big storage demands are especially vulnerable to attrition. Who wants to pay the mounting bill for keeping decades of revisions of historical rust packages online?

I can grab the kernel sources from 1997 and build them today. Will I be able to build rust code from 2022 in 2047, because the 1997 kernel will still build at that date.

"I can grab the kernel sources from 1997 and build them today."

Where would you be grabbing it from? ...From a website? "Websites shut down, large websites with big storage demands are especially vulnerable to attrition. Who wants to pay the mounting bill for keeping decades of revisions of historical Linux kernels online?"

> I can grab the kernel sources from 1997 and build them today.

Can you? Do they still compiler with current compiler? You'll probably need to find a compiler of that time... And also all the interpreter for all the build scripts. Was that using bash or some old Perl? Maybe something more esoteric like m4 or tcl?

The point is that it always had many external dependencies to bootstrap. And adding one is not such a big deal, it just add another thing to archive among the many other things. The crates.io archive is probably not even that big.

I'd rather comment than downvote. Who cares about about a kernel build from 1997 (25 years ago)? What was the hardware back then, Pentium 2? Sorry for the snark in advance but: Why make mountains out of molehills? Life is hard enough as it is.
I would imagine the kernel would use cargo vendor, or similar, to lock all dependencies into their chosen source control and quality requirements.

https://doc.rust-lang.org/cargo/commands/cargo-vendor.html

"Never" is a long time, just saying. It'll be impossible to beat the "availability" guarantees of a local mirror (like a thumb drive) of a kernel source tarball.

What happens when a crate version has to be removed due to a critical CVE or court order (IP Law violation, perhaps)? There may come a day where crates.io becomes torn between not breaking Linux source and not hosting actively bad source code.

Note that some of those concerns do apply to vendoring source as well, but the additional download step also removes options that the kernel maintainers have as long as they ship all the source for the kernel in one tarball. Like more control over the timing of inevitable decisions.

> What happens when a crate version has to be removed due to a critical CVE or court order (IP Law violation, perhaps)?

CVE = The Yank flag. Cargo will refuse to add new yanked packages to a lock file, but if a yanked package is already in the lock file, it will still build. The package is not actually deleted. https://doc.rust-lang.org/cargo/commands/cargo-yank.html

Legal = Hard delete. Nobody will go to jail just to avoid breaking your build. Of course, since crates.io and kernel.org are in the same legal jurisdiction, is there any actual difference here?

What happens today when a kernel module has to be removed due to a critical CVE or court order?

That's not just a rhetorical flourish, I'm actually curious what the answer is. As far as I know, (1) it almost never happens and (2) when it does, the change is made in upstream repos and as a practical matter, everyone downloads those changes and their up-to-date local copies lose that code.

Does crates.io actively host any code? I thought it was all just readmes and links to github and docs.rs
To the first question, obviously the sources of dependencies would be brought into the tree. This is easy and there's no reason I'm aware of not to do it for something like the Linux kernel.

To the second set of questions, how is this any different than any other dependency the kernel has? If the answer is "the kernel has no dependencies" then yeah, I'm very sympathetic to the argument that bringing in rust libraries is not a good reason to start having dependencies when none previously existed at all, but is that the case?

You're forgetting about custom build scripts. Thankfully most of the core ones have moved off cloning dependencies for ffi purposes (think cloning an alsa-lib version for ffi), but it used to be super common.
The lock file is created but is not used by default.

You must specify --locked to get that behaviour

No, it is. Even without `--locked`, the Cargo.lock file is only updated when it no longer fulfills the Cargo.toml because the latter was edited (and then only making the minimal changes necessary), or explicitly using `cargo update`.
I don’t follow - I’m saying the cargo.lock isn’t read unless you specify —locked - I’m not talking about when it gets refreshed?
Yes, it's always read. If the file didn't require updating, a build with and without `--locked` will be identical. If it did require updating, `--locked` will make cargo exit with an error.
That's true when running `cargo install` to install an application directly from crates.io, but not when running `cargo build` in an already checked-out repository.
I might be misreading this on an iphone screen but as i follow the logic here: https://github.com/rust-lang/cargo/blob/a77ed9ba87bfeaf3c275...

A cargo build ends up there calling into the resolver’s resolve_ws_with_opts() which would refresh the lockfile.

Not resolve_with_previous() which would use the lock file as-is.

The only reason this sticks in my mind is i ran into an issue building bat after i made some changes, i obviously assumed it was my changes so went through the process of debugging and backing out my changes until finally i was back to a virgin branch and still failing - passing —frozen —locked fixed it.

> It doesn't just download random things.

That's exactly what it does. The developer is not really expected to thoroughly review the codebase of every dependency.

Just like javascript, all sort of supply chain attacks are made possible.

A single malicious library can sneak into large ecosystems easily.

If your project has a Cargo.lock file checked into its repo, then everyone checking that out will download the same code for all dependencies (unless someone manages to compromise the crates.io package archive). That is very far from "the least reproducible thing ever".
Cargo.lock also contains crate hashes. So, if someone compromises crates.io and tampers with a crate, you would notice.
> The default behavior of cargo is to download stuff from the internet. This may be the least reproducible thing ever.

Wait till you find out about java ecosystems

I know investment bank dev teams pulling whatever they need from maven central with no oversight or introspection.

Log4j joins the conversation.
yea i'm glad i wasn't around when that was discovered.
> The default behavior of cargo is to download stuff from the internet.

This is borderline inevitable for most modern development stacks, though .lock files can definitely help, even adding hashes to check against if you care about your dependencies being the same as when you first download/add them to the project and/or inspect the code.

As for worries about the things in those URLs disappearing, in most cases you should be using a proxy repository of some sort, which i've seen leveraged often in enterprise environments - something like JFrog Artifactory or Sonatype Nexus, with repositories either globally, or on a per-project basis.

The problem here is that all of these repositories kind of suck and that the ecosystem around them also does:

  - for example, Nexus routinely fails to remove all of the proxied container images and their blobs that are older than a certain date, bloating disk space usage
  - when proxying npm, Nexus needs additional reverse proxy configuration, since URL encoded slashes aren't typically allowed
  - many popular formats, like Composer (or plenty more niche ones) are only community supported https://help.sonatype.com/repomanager3/nexus-repository-administration/formats (nobody will ever cover *all* of the formats you need, unless you limit yourself to very popular stacks)
  - many of the tech stacks that have .lock files may also include URLs to the registry/repository from which they're acquired, so some patching might be necessary
  - in technologies like Ruby, actually setting up the proxy isn't as easy as running something like "bundle install --registry=..." as it is in npm
  - in other technologies, like Java, you get into the whole SNAPSHOT vs RELEASE issue and even setting up publishing your own packages to something like Nexus can be a bit of work; the lack of proper code libraries for reuse and abundance of code being copy-pasted that i've been being a proof of this in my mind
Of course, i'm mentioning various tech stacks here and i don't doubt that in the long term Rust and other technologies might also address their own individual shortcomings, but my point is that dependency management is just a hard problem in general.

So, for most people the approach that they'll take is to just install stuff from the Internet that other people trust and just hope that the toolchain works as expected, a black box of sorts. I've seen plenty of people just adding packages without auditing 100% of the source code which seems like the inevitable reality when you're just trying to build some software with time/resource constraints.

I'd really like to know where you think C++ dependencies and headers come from.
Downloading C++ dependencies during the build process is equally unacceptable for many situations. Existing C++ build systems and package managers can be configured to do that and those build systems and package managers would be inappropriate for supporting a kernel that values stability and long term support.
So it's a good thing that cargo can be used without downloading dependencies during the build! Just clone the repos of the dependencies (and transitive dependencies), just like you would for a C++ project. Then set up your cargo file to point at the location for your local copy instead of using the default download behavior.

There's even a tool called cargo-vendor that does this for you!

I was just saying anything folks might find objectionable about cargo workflows happens in C and C++ workflows, just in ad hoc ways.

The difference is that no C or C++ package management features are proposed for incorporation in the Linux kernel SDLC.