Hacker News new | ask | show | jobs
by steven-xu 1535 days ago
Summarizing the 3 major JS package management approaches:

* Classic node_modules: Dependencies of dependencies that can't be satisfied by a shared hoisted version are nested as true copies (OSes may apply copy-on-write semantics on top of this, but from FS perspective, these are real files). Uses standard Node.js node_modules resolution [1].

* pnpm: ~1 real copy of each dependency version, and packages use symlinks in node_modules to point to their deps. Also uses standard resolution. Requires some compatibility work for packages that wrongly refer to transitive dependencies or to peer dependencies.

* pnp[2]: 1 real copy of each dependency version, but it's a zip file with Node.js and related ecosystem packages patched to read from zips and traverse dependency edges using a sort of import map. In addition to the compatibility work required from pnpm, this further requires compatibility work around the zip "filesystem" indirection.

In our real-world codebase, where we've done a modest but not exhaustive amount of package deduplication, pnpm confers around a 30% disk utilization savings, and pnp around a 80% savings.

Interestingly, the innovations on top of classic node_modules are so compelling that the package managers that originally implemented pnpm and pnp (pnpm and Yarn, respectively) have implemented each others' linking strategies as optional configs [3][4]. If MacOS had better FUSE ergonomics, I'd be counting down the days for another linking strategy based on that too.

[1] - https://nodejs.org/api/modules.html#loading-from-node_module... [2] - https://yarnpkg.com/features/pnp [3] - https://github.com/pnpm/pnpm/issues/2902 [4] - https://github.com/yarnpkg/berry/pull/3338

3 comments

> 1 real copy of each dependency version...

npm showed me that I lack creativity, for I could not imagine anything worse than maven.

The ~/organization/project/release dir structure is the ONE detail maven got right. (This is the norm, the Obviously Correct Answer[tm], right?)

And npm just did whatever. Duplicate copies of dependencies. Because reasons.

Node is doing the right thing: if two dependencies in maven have conflicting dependencies, maven just picks an arbitrary one as _the_ version, which results in running with an untested version of your dependency (the dependency is actually depending on a version the developers of that dependency didn’t specify). Because node allows the same dependency to be included multiple times, npm and friends can make sure that every dependency has the right version of its dependencies.
> Node is doing the right thing

Node does a different thing. It can coalesce two different versions into one if the two things are within a certain semver range, but there's nothing that enforces whether things within a semver range are actually compatible. The most prominent example is Typescript, which famously does not follow semver. Another notable example of how NPM itself does things wrong is that it considers anything in the `^0.x` range as compatible, whereas semver distinctly says the 0.x range is "anything goes".

Incompatible libs, you say? Try this one on: once upon a time a handful of years ago a package-lock.json I worked on drifted so far from package.json that you could not remove package-lock.json and rebuild purely from package.json. The versions specified in the package.json were incompatible with each other, but the package-lock.json had somehow locked itself to a certain permutation of versions that it somehow just worked.

I always shudder to think that different versions of packages live in node_modules and one library produces an object that somehow makes it to the other version of the library and... I'd rather not think of all these implications or I would go crazy.

I agree another the 0.x thing. The rest is basically a result of people refusing to use the versioning system the way it’s designed to be used, which is a problem with a package not with the specified behavior of npm here: violating the rules of semver is UB
I would definitely put part of the blame on the design of the system. It allows anyone to write stuff like `"lodash": "*"`, which is a perfectly valid range as far as semver goes. And then there's things like yarn resolutions, where a consumer can completely disregard what a library specifies as its dependencies and override that version with whatever version they want. And there's aliases (`"lodash": "npm:anotherpackage@whatever"`) and github shorthands and all sorts of other wonky obscure features. And we haven't even touched on supply chain vulns...
>> maven just picks an arbitrary one as _the_ version

No that’s never been the case. If you have conflicting versions of a dependency in your dependency graph, maven chooses the “nearest neighbour” version - it selects the version specified least far away from your project in the transitive dependencies graph.

Pinning a particular choice is easy too - you just declare the dependency and specify the version you want instead of relying on transitive deps.

This is what I mean by an arbitrary version: it’s not determined by the dependency but by some characteristic of the dependency tree. And, this is only necessary because the JVM can’t load two versions of the same dependency (ignoring tricks like the maven-shade-plugin)
The JVM can load the same class any number of times through different class loaders -> only the (class, classloader) tuple has to be unique.

I guess the reason they didn’t went the duplicative direction is that java has safe class loading semantics at runtime and due to valuing storage/memory capacity (which was frankly a sane choice, like 10x bigger java projects compile faster than a js project that pretty much just copies shit from one place to another?)

That’s kind of incredible that yarn pnp out performs pnpm. If that’s generally true across most projects then I’m really glad that turborepo decided to use it for project subslicing.
The practical disk usage difference between pnp and pnpm appears to be almost entirely accounted for by the fact that pnp zips almost all packages. Both store ~1 version of each package-version on disk; it's just that one's zipped and one's not. The mapping entries for package edges in both cases (.pnp.cjs for pnp and the symlink farms for pnpm) are very small in comparison.
Disk utilization is only one metric. The trade-off for Yarn PNP is that it incurs runtime startup cost. For us (~1000+ package monorepo), that can be a few seconds, which can be a problem if you're doing things like CLI tools.

Also, realistically, with a large enough repo, you will need to unplug things that mess w/ file watching or node-gyp/c++ packages, so there is some amount of duplication and fiddling required.

Problames long solved before, but problems that don't matter to the javascript crowd.. I think they actually love that things take so long. It makes them thing they're doing important work.. "We're compiling and initting"
If your file system supports compression, (e.g. zfs and btrfs) then the actual disk usage of pnp and pnpm should be similar ?