Hacker News new | ask | show | jobs
by mwcampbell 3776 days ago
A large number of dependencies is only a problem in environments that aren't amenable to per-function static linking or tree-shaking. These include dynamically typed languages like Python, Ruby, and JavaScript (except when using the Google Closure Compiler in advanced mode), but also platforms like the JVM and .NET when reflection is allowed. Where static linking or tree-shaking is feasible, the run-time impact of bringing in a large library but only using a small part of it is no more than the impact of rewriting the small part you actually use.

Edit: Dart is an interesting case. It has dynamic typing, but it's static enough that tree-shaking is feasible. Seth Ladd's blog post about why tree-shaking is necessary [1] makes the same point that I'm making here.

[1]: http://blog.sethladd.com/2013/01/minification-is-not-enough-...

9 comments

The run time resource usage risks are only one element of deep dependency trees, and certainly the one I worry about least. The biggest risks are the fact that you've handed every person in your dependency tree developer status in your project. That includes not just potential maliciousness, though that is a factor, but potentially disappearing the project entirely, potentially taking it in an unexpected direction, potentially introducing bugs (oh that all projects merely monotonically got better), potentially introducing security vulnerabilities in deep parts of the stack you wouldn't even think to audit, diamond dependencies, difficult-to-replicate builds, etc. There are then tools that can strike back at some of these, but there's something to be said for avoiding the problem.

For that matter, there's no guarantee that tree-shaking would even have fixed the referenced issue; if the library preloaded 10MB of stuff, like a Unicode definition table, that you didn't use, but the tree shaker couldn't quite prove you never would, you'll still end up with it loaded at runtime. (For that matter, you may very well be using such code even though you don't mean to, if, for instance, you have code that attempts to load the table, and uses it if it loads for something trivial, but will just keep going without it if it is not present. The tree shaker will determine (correctly!) that you're using it.)

Basically, tree shaking only sort of kind of addresses one particular problem that deep dependencies can introduce, and that one not even necessarily reliably and well.

You're right; I over-stated the benefits of tree-shaking. My comment came out of long frustration that so many of the languages and platforms we use don't even support tree-shaking, so if we care about binary size, we have to do micro-management of dependencies that the machine should be able to do for us. But you're right; the other issues with dependencies remain.

About your 10 MB Unicode table example, did you have in mind the ICU data table that's included with Chromium and, more recently, all the Electron-based apps?

"About your 10 MB Unicode table example, "

The precise thing that I have personally experienced is Encode::HanExtra for Perl "getting around" in my code base and being able to save a couple of megabytes by loading it only when necessary. But the principle is the same, I'm sure. Full databases on all the characters in Unicode and such gets quite large!

Not always. Dependencies were a huge problem at Google, even in C++ (perhaps especially in C++), because they mean that the linker has to do a lot of extra work. And unlike compiling, linking can't be parallelized, since it has to produce one binary. At one point the webserver for Google Search grew big enough that the linker started running out of RAM on build machines, and then we had a big problem.

There's still no substitute for good code hygiene and knowing exactly what you're using and what additional bloat you're bringing in when you add a library.

That's a pretty significant special case though. I'd be willing to go with the advice "If you get as big as Google's codebase, be sure to trim the dependencies on your statically-bound languages too." But you probably have a ways to go before that's an engineering concern for your project.

(... note that one could make a similar argument for more runtime-dynamic languages. I won't disagree, other than to observe that as a lone engineer, I've managed to code myself into a corner with dependencies in Rails ;) ).

^ THIS. I wish I had more up votes.

The amount of time I've seen wasted trying to scale to Google is insane. People should worry about what Google does when they work for at least a billion dollar company.

For most projects import as many dependencies as you can as you are getting free labour. Sure, once in a while you'll fuck something up and waste a week or two, but it pales in comparison to the months you didn't spend reinventing the wheel.

No one really ever notices that it's all the companies with boat loads of cash that have massive technical debt. Even with the example at Google the first thing I'd try is jamming more memory in those machines, keep going until the linker needs more than 256 GB.

Fuck, Facebook still uses PHP, the stock market doesn't seem to care.

Was it running out of memory because of templates? For instance, parts of boost like boost serialization generates an obscene amount of symbols due to the way they do metaprogramming.
Templates were a problem but not a huge one. They aren't used extensively in the webserver, and in any case they bloat compile-time moreso than link-time.

The bigger problem was that we'd adopted a dependency strategy of "lots of little libraries" instead of "one big library with lots of source files". This offloads a lot of the work from the compiler to the linker. There are various advantages of this strategy - it speeds up incremental rebuilds, it encourages you to explicitly track all your dependencies, it simplifies IWYU, it's easier to parallelize - but linker RAM usage is not one of these advantages.

Wow. Was that with or without link-time optimization?
It was debug builds, which put an additional strain on the linker in that they keep around all the debug symbols. Regular builds were slow but manageable, so it's not like we couldn't release or develop, but it meant tracking down any sort of crash or serious bug became very difficult until we got the dependencies under control.

I forget the exact compiler settings - wasn't my department - but I think it included link-time optimization, and also FDO.

Ran out of RAM or address space? I've run out of RAM trying to ridiculous stuff like native builds on tiny embedded systems (due to whack code bases that wouldn't cross compile). Though, in the end I overcame this with even more perverse solutions like adding swap space via USB1 flash.

That aside, that is the third time in a couple of months that I have heard people mention situations where they ran out of RAM without explaining why swap could not at least work as a stop gap measure.

Ran out of RAM for cloud builds - Google generally does not use virtual memory for anything in the cloud, because it leads to unpredictable, massive delays which can cause cascading failures in services.

It was still "possible" to build on your workstation, but the locality patterns in linking and subsequent thrashing made this extremely an extremely small value of "possible". I recall once during this period I kicked off a local debug build on my workstation on Friday afternoon, went home for the weekend, and it was still running when I got into work on Monday morning. By Tuesday, I had given up and killed it.

Tree shaking is helpful but not enough. It makes dependencies more fine-grained and binaries smaller by removing some false sharing. But library maintainers still have to be careful about true sharing, where a function calls another function, which in turn pulls in something big (like a lot of data stored in a constant).

You need both tree shaking and a community dedicated to keeping code small.

Javascript has the latter; it's not universally true, but lots of JavaScript libraries pride themselves on small code size and few dependencies.

That's great. But you can't stop doing that work just because you have a tree-shaking compiler. For example, there's a lot of work going into making Angular 2 apps reasonably sized and dart2js doesn't magically make it go away.

You're right; tree-shaking doesn't eliminate all instances of dependency bloat. Come to think of it, I've even seen a counter-example in C++. On Windows, a hello-world application using the wxWidgets GUI toolkit is ~2.4 MB, even statically linked. I think the problem, or at least part of it, is that the WindowProc implementation uses a big switch statement to handle all of the Win32 window messages that wx supports. So, for example, the handler for the WM_PRINTCLIENT message has to be included even if your application doesn't do any printing. Same for drag and drop. It would be better if the WindowProc implementation looked up message handlers in a table, and the application could ask wx to register the handlers for just the features that are actually used. I wouldn't be surprised if similar concerns apply to frameworks like Angular, even in Dart.
Yes, UI frameworks are especially prone to this. Anywhere you have generic "display arbitrary thing on the screen" functionality, it logically depends on all the code you might need to fulfill that promise for arbitrary input.

Custom HTML tags typically can have arbitrary children, so the issue inevitably comes up. The reason a normal web page can be small is because the browser has already been downloaded.

I can't disagree with you, but you're also missing other issues related with having a big pile of dependencies. Maintainability being one. Runtime efficiency isn't the only problem that can be solved here.
Yeah, but it seems like you listed most of the platforms most people actually use, all in the X column. How do we get from 100MB Electron deployment and fat, partially used jars and dlls to this magical tree-shaking world?
You're right; most of the languages and platforms used for developing applications don't have reliable support for tree-shaking or fine-grained static linking. But there's hope for some of them.

When building Android applications, it's common to process the JVM bytecode with ProGuard before converting it to Dex bytecode. ProGuard includes a tree-shaking step. Sometimes it's necessary to tell ProGuard about specific classes or class members that it should leave alone, if they're accessed dynamically (e.g. using java.lang.reflect or Class.forName). But it's still better than nothing.

Likewise, .NET applications for the Windows Store are compiled to native code using .NET Native, and that compilation includes a tree-shaking step. This introduces some limitations on the use of reflection. I'm guessing similar limitations will apply to the native compilation option of .NET Core.

As for JavaScript, Google's Closure Compiler can do tree-shaking. But I don't know if the Closure Compiler's advanced mode works with any of the popular JavaScript libraries or frameworks, or just Google's Closure Library.

Good point about Android and .NET.

Regarding the Closure Compiler, ClojureScript touts whole program optimization as an advantage; the ClojureScript compiler just has to play by the Google Closure rules when emitting JS to take advantage. I'm sure it is harder for users of any arbitrary off-the-shelf library to gain the benefits.

It looks like Webpack 2 will have support for tree-shaking: https://github.com/webpack/webpack/pull/861#issuecomment-149...

There's also rollup.js (another module bundler) that supports tree-shaking: http://rollupjs.org/

Either way, it seems like the code needs to be using ES6 modules to make it all work.

JSPM already uses it for sfx builds

https://github.com/systemjs/builder/pull/205

ES6 modules allow devs to easily specify and import only the parts of a library that are being used. The bundler takes care of pulling in the necessary parts. No magic required.

The jvm isn't really penalized because the concurrency model tends to be threads, rather than processes, so you don't pay a per-worker cost for a large library. Plus the jvm will optimize only the actually used code, including de-virtualizing.
It helps somewhat, but I feel that the "only is a problem" assertion is too strong.

Tree shaking doesn't help you when you are pulling in every HTTP client in existence transitively. It is still code that is being run, so can't be automatically optimized away, but it is unnecessary.

The casual link between tree-shaking-compatible-languages and accepting a large number of dependencies is not that clear. It could be that those who use large number of dependencies just prefer less dynamic languages when dependencies are very explicit and manageable.
Perhaps I wasn't clear. What I meant is that in environments that do support tree-shaking, you can depend on large libraries, and/or many small libraries, and the run-time impact will be no more than if you had written or copied and pasted just the functionality you need.
This could be very true, but to show the causality it is necessary that tree shaking abilities of language/tools on average precedes widespread use of huge dependency trees. It could be in reverse. That is when for some unknown reasons multiple dependencies appears, tree-shaking tools follow and it is just easier to create them for static languages.
> It could be that those who use large number of dependencies just prefer less dynamic languages when dependencies are very explicit and manageable.

Tell that to anyone using NPM.

Clarification - my point was that a sane person prefers dependencies to be explicit and manageable. It is easier to create tools for that with static languages. As for NPM due to dynamic nature of JS the tooling is rather hard and just not there yet.
Dependencies (and their specific versions) are already managed explicitly via package.json.

The shift to a flat dependency hierarchy in NPMv3 will make managing dependencies of dependencies much more explicit and straightforward to manage.

JSPM already uses a flat structure and shows how simple dependency management can be.

I'm not sure what you mean by "the dynamic nature of the tooling". The JS development ecosystem doesn't attempt to provide an end-all-be-all monolithic core lib. It's a good thing and one of the primary reasons why advances in the JS evosystem are happening at breakneck pace.

PHP doesn't have tree-shaking, as it isn't compiled, but it does have autoloading: the source code files for classes are only loaded from disk when they're instantiated. This is probably similarly beneficial.