Hacker News new | ask | show | jobs
by eemax 1696 days ago
We pin all of our npm dependencies and upgrade them via dependabot. Dependabot links to the GitHub or GitLab release for each dependency bump, and I typically skim / scan every single commit to each dependency. But there's no guarantee that what's on GH matches what is uploaded to npm (which is what happened in this case; there are no malicious commits).

Does anyone know of a good way to verify that a npm release matches what's on GH? Version controlling the entirety of node_modules/ and running untrusted updates in a sandbox would work in theory, but in practice many packages contain minified js which makes the diffs between version bumps unreadable.

5 comments

Skip the nonsense and just check your dependencies in directly to your repo. The separation has no real world gains for developers and doesn't serve anyone except the host of your source repo. As it turns out most people's repo host is also the operator of the package registry they're using, so there aren't even theoretical gains for them, either.

Doing it this way doesn't preclude the ability to upgrade your dependencies, it _completely_ sidesteps the intentional or unintentional desync between a dependency's source and its releases, it means people have to go out of their way to get a deployment that isn't reproducible, and in 4 years when your project has rotted and someone tries to stand it up again even if just temporarily to effect some long-term migration, then they aren't going to run into problems because the packages and package manager changed out from beneath them. I run into this crap all the time to the point that people who claim it isn't a problem I know have to be lying.

> I run into this crap all the time to the point that people who claim it isn't a problem I know have to be lying.

I don't think that's right.

Just because someone denies a problem exists—a problem that you know for a fact, with 100% certainty exists—doesn't mean they're lying.

It may mean you know they are wrong, but wrong != lying, and it's a good thing to keep in mind.

If you have external reasons to believe that the person you're talking to should or does know better, then it's fair to say they are lying.

But, in general, if you accuse someone who is simply wrong to be lying, you're going to immediately shut down any productive conversation that you could otherwise have.

People don't do this because `node_modules` can be absolutely massive (hundreds of megabytes or more), and a lot of people don't like (for various reasons) such large repositories.

There is a deprecated project at my work that committed the entire yarn offline cache to the repo. At least those were gzipped, but the repo still had a copy of every version of every dependency.

It isn't a good long term solution unless you really don't care at all about disk space or bandwidth (which you may or may not).

A middle ground that I've seen deployed is corporate node mirrors with whitelisted modules. Then individual repos can just point to the corporate repo. Same thing for jars, python packages, etc.
And build pipelines that fail due to the size of the repo.
Committing node_modules and reproducibility are somewhat not orthogonal though.

You can get reasonable degrees of reproducibility by choosing reasonable tools: Yarn lets you commit their binary and run that in the specified repo regardless of which version you have installed globally. Rush also allows you to enforce package manager versions. Bazel/rules_nodejs goes a step further and lets you pin node version per repo in addition to the package manager. Bazel+Bazelisk for version management of Bazel itself provides a very hermetic setup.

Packages themselves are immutable as long as you don't blow away your lockfile. I used to occasionally run into very nasty non-reproducibility issues with ancient packages using npm shrinkwrap (or worse, nothing at all), but since npm/yarn got lockfiles, these problems largely went away.

These days, the non-hermeticity stuff that really grinds my gears is the very low plumbing stuff. On Mac, Node-GYP uses xcode tooling to compile C++ modules, so stuff breaks with MacOS upgrades. I'm hoping someone can come up with some zig-based sanity here.

As for committing node_modules, there are pros and cons. Google famously does this at scale and my understanding is that they had to invest in custom tooling because upgrades and auditing were a nightmare otherwise. We briefly considered it at some point at work too but the version control noise was too much. At work, we've looked into committing tarballs (we're using yarn 3 now) but that also poses some challenges (our setup isn't quite able to deal w/ a large number of large blobs, and there are legal/auditing vs git performance trade-off concerns surrounding deletion of files from version control history)

Ill-advised tool adoption is exactly the problem I'm aiming to get people to wake up and say "no" to. You need only one version control system, not one reliable one plus one flaky one. Use the reliable one, and stop with the buzzword bandwagon, which is going to be a completely different landscape in 4 years.

> Packages themselves are immutable as long as you don't blow away your lockfile

Lockfiles mean nothing if it's not my project. "I just cloned a 4 year old repo and `npm install` is failing" is a ridiculous problem to have to deal with (which to repeat, is something that happens all the time whether people are willing to acknowledge it or not). This has to be addressed by making it part of the culture, which is where me telling you to commit your dependencies comes from.

This problem does happen, but committing node_modules won't fix it. Assuming the npm registry doesn't dissapear, npm will download the exact same files you would have committed to your repo. Wherever those files came from, 4 years later you upgraded your OS, and now the install step will fail (in my experience usually because of node-gyp).

Unless you were talking about committing compiled binaries as well, in which case every contributor must be running the same arch/OS/C++ stdlib version/etc. M1 laptops didn't exist 4 years ago. If I'm using an M1 today, how can I stand up this 4 year old app?

The real problem is reproducible builds, and that's not something git can solve.

Assuming the npm registry doesn't dissapear, npm will download the exact same files you would have committed to your repo

I think that’s where you missed what the gp is saying.

If everything is checked in to source control, npm will have nothing to download. You won’t need to call npm install at all, and if you do it will just return immediately saying everything is good to go already.

The workflow for devs grabbing a 10 year old project is to check it out then npm start it.

You are probably not running the same OS/node/python version as you were 10 years ago. If you were to try this in real life, you'd get an error like this one. https://stackoverflow.com/questions/68479416/upgrading-node-....

The error helpfully suggests you:

>Run `npm rebuild node-sass` to download the binding for your current environment.

Download bindings? Now you're right back where you started. https://cdn140.picsart.com/300640394082201.jpg

Of course if you keep a 10 year old laptop in a closet running Ubuntu 10.04 and Node 0.4, and never update it through the years, then your suggestion will work. But that workflow isn't for me.

> which to repeat, is something that happens all the time

Are you sure this isn't just a problem in your organization? As I qualified, the issue you're describing was a real pain maybe two or three years ago, but not anymore IME. For context, my day job currently involves project migrations into a monorepo (we're talking several hundred packages here) and non-reproducibility due to missing lockfiles is just not an issue these days for me.

As the other commenter mentioned, node-gyp is the main culprit of non-reproducibility nowadays, and committing deps doesn’t really solve that precisely because you often cannot commit arch-specific binaries, lest your CI will blow up trying to run mac binaries

> Are you sure this isn't just a problem in your organization?

I'm really struggling to understand the kind of confusion that would be necessary in order for this question to make sense.

Why do you suspect that this might be a problem "in [my] organization"? How could it even be? When I do a random walk through projects on the weekend, and my sights land on one where `npm install` ends up failing because GitHub is returning 404 for a dependency, what does how things are done in my organization have to do with that?

I get the dreadful feeling that despite my saying "[That] means nothing if it's not my project", you're unable to understand the scope of the discussion. When people caution their loved ones about the risk of being the victim of a drunk driving accident on New Years Eve, it doesn't suffice to say, "I won't drink and drive, so that means I won't be involved a drunk driving accident." The way we interact with the whole rest of the world and the way it interacts with us is what's important. I'm not concerned about projects under my control failing.

> non-reproducibility due to missing lockfiles is just not an issue

Why do you think that's what we're talking about? That's not what we're talking about. (I didn't even say anything about lockfiles until you brought it up.) You're not seeing the problem, because you're insisting on trying to understand it through a peephole.

I mean, of course I'm going to see this from the lenses of my personal experience (which is that nasty non-reproducibility issues usually would only happen when someone takes over some internal project that had been sitting in a closet for years and the original owner is no longer at the company). Stumbling upon reproducibility issues in 4 year old projects on Github is just not something that happens to me (and I have contributed to projects where, say, Travis CI had been broken in master branch for node 0.10 or whatever) and getting 404s on dependencies is something I can't say I've experienced (unless we're talking about very narrow cases like consuming hijacked versions of packages that were since unpublished) or possibly a different stack that uses git commits for package management (say, C) - and even then, that's not something I've run into (I've messed around w/ C, zig and go projects, if it matters). I don't think it's a matter of me having a narrow perspective, but maybe you could enlighten me.

As I mentioned, my experience involves seeing literally hundreds of packages, many of which were in a context where code rot is more likely to happen (because people typically don't maintain stuff after they leave a company and big tech attrition rate is high, and my company specifically had a huge NIH era). My negative OSS experience has mostly been that package owners abandon projects and don't even respond to github issues in the first place. I wouldn't be in a position to dictate that they should commit node_modules in that case.

Maybe you could give me an example of the types of projects you're talking about? I'm legitimately curious.

> At work, we've looked into committing tarballs (we're using yarn 3 now) but that also poses some challenges (our setup isn't quite able to deal w/ a large number of large blobs, and there are legal/auditing vs git performance trade-off concerns surrounding deletion of files from version control history

With Git LFS the performance hit should be relatively minimal (if you delete the file it won't redownload it on each new clone anyways, stuff like this).

And what happens when you need to update those dependencies?

Software is a living beast, you can't keep it alive on 4yr-old dependencies. In fact, you've cursed it with unpatched bugs and security issues.

Yes, keep a separate repo, but also keep it updated. The best approach is to maintain a lag between your packages and upstream so issues like these are hopefully detected & corrected before you update.

> And what happens when you need to update those dependencies?

Then you update them just like you do otherwise, like I already said is possible.

> you can't keep it alive on 4yr-old dependencies. In fact, you've cursed it with unpatched bugs and security issues

This is misdirection. No one is arguing for the bad thing you're trying to bring up.

Commit your dependencies.

Let's say that the day you update your dependencies is after this malware was injected but before it was noticed.

Now you have malware in your local repo :(

Having a local repo does not prevent malware. Your exposure to risk is less because you update your dependencies less frequently, but the risk still exists and needs to be managed. There's no silver bullet.

This is more misdirection. By no means am I arguing that if you're doing a thousand stupid things and then start checking in a copy of your dependencies, that you're magically good. _Yes_ you're still gonna need to sort yourself out re the 999 other dumb things.
Sounds like a great way to end up with what $lastco called "super builds" - massive repos with ridiculous amounts of cmake code to get the thing to compile somewhat reliably. It was a rite of passage for new hires to have them spend a week just trying to get it to compile.

All this does is concentrate what would be occasionally pruning and tending to dependencies to periodic massive deprecation "parties" when stuff flat out no longer works and deadlines are looming.

That’s the whole deal with yarn 2 isn’t it? With their plug’n’play it becomes feasible to actually vendor your npm deps, since instead of thousands upon thousands (upon thousands) of files you only check in a hundred or so zip files, which git handles much more gracefully.

I was skeptical at first as it all seemed like way too much hoops to jump through, but the more I think about it the more it feels that it’s worth it.

> Skip the nonsense and just check your dependencies in directly to your repo.

Haha, no.

That would increase the size of the repository greatly. Ideally, you would want a local proxy where the dependencies are downloaded and managed or tarball the node_modules and save it in some artifacts manager, server, or s3 bucket

What's the problem with a big repository? The files still need to be downloaded from somewhere. It's mostly just text anyway so no big blobs which is usually what causes git to choke.

For that one-off occasion when you are on 3G, have a new computer without an older clone, and need to edit files without compiling the project (which would have required npm install anyway), there is git partial clone.

Does npm have a shared cache if you have several projects using the same dependencies?

>Does npm have a shared cache if you have several projects using the same dependencies?

pnpm does, that's why I'm using it for everything. It's saving me many gigabytes of precious SSD space.

https://github.com/pnpm/pnpm

Now anything with native bindings is broken if you so much as sneeze.
> Does anyone know of a good way to verify that a npm release matches what's on GH?

I'm not aware of any way to do this, and it's a huge problem. It would be great if they introduced a Docker Hub verified/automated builds[0]-type thing for open source projects. I think that would be the only way we could be certain what we're seeing on GitHub is what we're running.

Honestly it’s hard to believe we all just run unverifiable, untrustable code. At the very least NPM they could require package signing, so we'd know the package came from the developer. But really NPM needs to build the package from GitHub source. Node is not a toy anymore, and hasn't been for some time—or is it?

[0] https://docs.docker.com/docker-hub/builds/

This is ~solvable at a third party level. Nearly everything on NPM (the host) is MIT licensed or similar. When packages are published, run their publish lifecycle and compare to the package that’s actually published.

I don’t have the resources or bandwidth to do this, but it’s pretty straightforward +- weird publishing setups.

Edit: of course this doesn’t apply to private repositories but… you’re in a whole different world of trust at that point.

I started working on this exact problem a few years ago. Didn't get far, though, I think I stopped because I assumed there just wouldn't be any real interest.
I couldn't find the code, so I just started over. Haven't hosted it anywhere yet.

https://github.com/connorjclark/npm-package-repro

Awesome! Thank you.
Doesn't npm have a facility to tell it to download releases directly from source? Most package managers have in one form or the other, but I'm not very familiar with npm.

To be honest I'm not sure if npm (the service, not the tool) and similar services really add all that much value. The only potential downside I see is that repos can disappear, but then again, npm packages can also disappear. I'd rather just fetch directly from the source.

This is how Go does it and I find it works quite well. It does have the GOPROXY now, but that's just an automatic cache managed by the Go team (not something where you can "login" or anything like that), so that already reduces the risk, and it's also quite easy to outright bypass by setting GOPROXY=direct.

Deno (https://deno.land/), another runtime based on v8, has a system similar to Go, with local and remote imports https://deno.land/manual@v1.11.5/examples/import_export.
You can’t really fetch from git because for the majority of packages there is a non-standard build step that packages do not consistently specify in package.json, if at all. Packages on NPM are just tarballs uploaded by the author. Furthermore, what about transitive dependencies?
> what about transitive dependencies?

What about them?

As for unspecified build steps: this seems like a solvable problem. I would just submit a patch.

Fetching from git is possible. The downsides are lack of semver, having to clone the full history of the repo, and having to clone the complete repo including files not needed for just using the lib, eg preprocessors, docs and tests.
I assume people use tags and such no? That gives you versions and you can just fetch it at a specific tag. Either way, this is very much a solvable problem.

A few docs and tests doesn't strike me as much of an issue.

Renovate gives you links to diffs on the published npm package: https://app.renovatebot.com/package-diff?name=hookem&from=1....

It's also great at doing bulk updates, so you get a lot less spam than you do from Dependabot.

Technically you could pin directly to a git commit instead of an NPM release
Although npm supports lifecycle methods that run before publish / on install, many packages fail to use those correctly (or at all) yet still require a build step, so using the GH repo directly very often does not work.
This is the right answer. Everyone else replying is patently wrong.
Edit: this was intended for a child comment sorry