Hacker News new | ask | show | jobs
by michaniskin 4242 days ago
The main reason why build tools exist is to perform transformations on a file set. The vast majority of the things build tools do that are useful are side effects.

These "side-effectful" functions are basically transducers; the reduction is performed on the file set instead of a sequence. They are mostly stateful transducers, but this does not fly in the face of functional programming. The opposite, in fact! They facilitate separation of concerns in a way that an immutable but global configuration map cannot.

2 comments

It's true that you need to interact with the filesystem eventually, but the longer that can be deferred, the more complexity can be avoided.

Ideally I'd want to construct a functional data structure that describes my build process, and at the end pass it to a side-effectful function to produce a change to the filesystem. Boot appears to be side-effectful from the get-go, but perhaps I'm mistaken about how it operates?

Sorry, I should have mentioned that while boot, like any JVM build tool, does begin and end with the class path, we have spent an enormous amount of time experimenting with ways to mitigate the effects. We ended up with a system that provides many of the benefits of immutability while still living in the real world where files actually exist.

Here are some of the things boot provides:

1. We have "pods", which are separate Clojure runtimes in isolated class loaders in which you can evaluate expressions. The actual building occurs in these things. They are lexically scoped and can have a different class path than the main Clojure runtime where your build pipeline runs.

2. Files emitted during the course of the build are created in temp dirs managed by boot. There are a few different kinds of these temp dirs, one of which is lexically scoped. We also have temp dirs that are effectively immutable from a given task's point of view (we use a copy-on-write scheme to achieve this).

3. We make liberal use of hard links and directory syncing to emulate immutability wherever we can. Boot provides a kind of structural sharing with these hard links that really makes the pain of dealing with files go away.

4. We put a great deal of thought into how artifacts flow through the build pipeline, and how tasks that don't know anything about each other can cooperate to work on these files.

This is the most interesting part of boot for me, and I'll be making a complete writeup about it soon.

It sounds like Boot has had a lot of thought put into it, and I certainly welcome the idea of effectively immutable directories.

But it also looks like Boot dives head first into I/O, when I'd prefer a build tool that is a little more circumspect about complexity. While I welcome competition to Leiningen, and I'll certainly keep an eye on Boot, my initial impression is that it's heading in the opposite direction to where I'd want a build tool to go.

I'm of the mind that build tools aren't going to get better unless we forcibly insert instrumentation between our build tasks and the stateful resources upon which they depend. To that end, I'd like to hear more about both your Clojure runtime isolation and filesystem isolation mechanisms.
All JVM build tools are side-effectful from the get-go, and they really have to be. Consider the :dependencies and :source-paths keys in a Leiningen project.clj. The purpose of these is to manipulate the mutable class path. To have a JVM build tool that doesn't revolve around the class path will require a complete reinvention of the JVM ecosystem and all of the existing tooling (like in our demo we use the Google Closure compiler, which mutates all kinds of things–that would have to go), which is, I'm sure, never going to happen.
You need to eventually be side-effectful, but that doesn't mean you need to start side-effectful. The :dependencies in a Leiningen project map are just a data structure until they're passed to eval-in-project, which happens at the end of a chain functional operations.

One of the core ideas of Clojure is that we should try to favour simple solutions over complex ones. Side-effectful functions are the some of the most complex tools we have, and while they are necessary eventually, it would be nice to have the majority of the code-base work with simple data structures, and push out the complexity of I/O to the edges of the application.

Side-effects aren't necessarily bad, it's composability that is good. Purity is a composable property, but there are plenty of composable side-effects.

For example, let's ignore all other effects besides file IO (especially ignoring internet IO). Further, let's assume that our only IO operations are `(read path) => data` and `(write path data) => nil`. Let's also assume that both of these operations are atomic (ie you can't perceive a half-written file). If a build task attempts to read a file that doesn't exist, that task pauses itself. When a file is written, any task waiting on the file are resumed. If you re-write a file, it has to exactly match the already written file, or the build fails. To kick-off a build, you wait on one or more files and then start one or more tasks.

Viewed this way, the file-system is a monotonic logic variable. Yes, the programming model is effectful, but there is a composable property: build repeatability. Just as you can compose arbitrary pure functions and get a pure function out, you can take any two arbitrary graphs of these constrained IO build tasks, compose them together, and the resulting larger graph will also be a repeatable build.

There's certainly a lot you can do to constrain I/O, but removing side-effects when possible will always be better than merely restricting them.

I do like the idea of immutable files and repeatable builds, but I don't think this negates the benefits of maximising the time you spend working with data.

For instance, there might be a task that automatically cleans up some files, but you want to keep those files around. If the tasks produce data structures that describe the build process, it's much easier to intercept and prevent or reorder the cleanup process. Functions, especially ones that are Turing-complete, are notoriously opaque.

> removing side-effects when possible will always be better than merely restricting them

I disagree. Effects are a very natural mental model for a great deal of problems and constraining yourself to purity is both impractical and quickly experiences diminishing returns.

Furthermore, if you can intercept effects, you can impose purity upon them. For an extreme example, consider application virtualization and containers such as Docker. By intercepting the system call table, you can create a "pure" filesystem from the view outside the container. At the other extreme, take a look at "extensible effects" and the Eff language, which lets you stub any subset of the effects available down to the individual expression!

> If the tasks produce data structures that describe the build process, it's much easier to intercept and prevent or reorder the cleanup process.

If you intercept all file IO, you can recover the same data. The only difference is whether or not you know that data upfront.

> Functions, especially ones that are Turing-complete, are notoriously opaque.

This is true! However, there are a great many build processes that do not know what they depend on or what they will produce until they do some Turing-equivalent work. For example, scanning a C header to find #include statements.

Rather than try to shoehorn all data in to a declarative model, we need both 1) fully declarative and 2) the ability to recover a declaration from the trace of an imperative.

An example of this trick, employed manually, is the notorious .d Makefiles. The C compiler finds all the dependencies, produces a submake file with the .d extension, then make-restarts recursively using the new .d file as part of the dependency graph. However, it's a very unnatural way to think about the problem and it leads to complex multi-pass build processes that are necessarily slower. Instead, the dependency graph could be produced as a side-effect of simply doing the compilation and that graph could be used as part of a higher-level declarative framework.

its not at all clear to me that you are correct.

I'm glad that people are putting thought into the cljs build process though, to this day it is still a particularly un-fun part of clojurescript.