Hacker News new | ask | show | jobs
by vlovich123 1221 days ago
> if we have a map and want to do some operation on a slightly modified map, we can have a value that keeps the old map but also works with the new map (without much performance cost).

What black magic is this? Is the article just glossing over the cost of a copy or does Haskell do something weird here to avoid the copy while retaining both versions?

7 comments

The second. Because everything in Haskell is immutable, it can take a _lot_ of liberties in terms of letting structures share parts of one another, in the case where one version is derived from the other.

Conversely, in the case where something like a list is modified in entirety (e.g. with a `map` function), if the compiler can determine that the original is no longer needed, it can run the map operation in place - much like you might do on an array in C - avoiding the need for a second copy of the structure in-memory.

The immutability guarantees let you build some really powerful constructs. Check out the world of immutable/functional data structures: https://en.wikipedia.org/wiki/Purely_functional_data_structu...
> In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not (visibly) update the structure in-place, but instead always yield a new updated structure. The term was introduced in Driscoll, Sarnak, Sleator, and Tarjans' 1986 article. (Making Data Structures Persistent - https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.133...)

From https://en.wikipedia.org/wiki/Persistent_data_structure

Clojure leverages the same data structure for (at least) four basic types; list, map, vector and set, and the following article explains it well with a pretty graph/picture too: https://practical.li/clojurescript/clojure-syntax/persistent...

I don't know about Haskell particularly but the underlying implementation of data structures in pure languages is sometimes more complicated to enable this sort of thing. For example, the new updated copy of the map may refer to the "old" map for most of its data, thereby making it cheap to have both.
So copy-on-write basically?
Yes, but with granularity at the "node" level instead of being across the whole container. Imagine a linked list.

If you add prepend an element to an existing list, no copy is necessary. It's just a new head with the tail being the original list. That's an easy case.

If you add a node in the middle of the list you can still share the unchanged tail between the two slightly different prefix sequences.

Copy-on-write generally refers to copying everything on every write; this is a persistent data structure, which can do updates while only copying a small part of itself. Same purpose as COW, but more efficient for larger data structures we want to make lots of small updates on.
Well, yeah, only there ain't any writes (and therefore, no copying).

The simplest example would be with linked lists, I suppose: you have a list A->B->C, then you take the B->C sublist and prepend X to it, so you now have X->B->C, but A->B->C is still around, and "B->C" part is shared between those.

No, because you don't copy the whole structure. You create a new partial structure.

E.g. let's say you create a map with keys A and B. You then insert a new key C. In memory you might have two objects: One is the origi al map with A and B and the other has "Key C and a ref to the first map".

No, there is no "write" because the old value never chnages. It's "create an overlay on write"
I can't speak for Haskell as I haven't looked at their implementation, avoiding copies is pretty typical of immutable data structures. Because each component of the data structure is immutable, you can safely point to any part of the structure that you don't need to change rather than copy it.

For example, in a basic binary search tree implementation of a map (using C-ish syntax for those who don't know a functional language):

    struct Node {
      String key;
      int value;
      Node left;
      Node right;
    }

    Node set(Node n, String key, int value) {
      if(n == null) {
        return new Node { key = key, value = value, left = null, right = null };
      }

      if(key < n.key) {
        return new Node {
          key = n.key,
          value = n.value,
          left = set(n.left, key, value),
          right = n.right
        };
      }

      if(key > n.key) {
        return new Node {
          key = n.key,
          value = n.value,
          left = n.left,
          right =  set(n.right, key, value)
        };
      }

      return new Node { key = key, value = value, left = n.left, right = n.right };
    }
A perfectly-balanced tree with depth of 5 has 1 + 2 + 4 + 8 + 16 = 31 nodes. If you call the above function, on such a tree, the worst case scenario is that the key doesn't exist, so it modifies 5 nodes during its search and creates a new sixth node. 26 of the original 31 nodes are reused and referenced by the newly-created map. The percentage of nodes reused only improves as the perfectly-balanced tree gets larger.

Of course, if this is your implementation of set(), the tree won't be perfectly-balanced, so a production implementation of a tree-based map needs tree-rebalancing (as well as memory-reordering and compacting for cache locality). These extra constraints typically mean less of the tree can be re-used, but the percentage of nodes which can be reused remains high.

The wording is terrible, but the simple common Haskell way is to use overlays/diffs.

You have an object. When you want to apply a patch, you create a new object that contains just your patch, plus a reference to the old obejct. DiffArray is the simple common example. It's fast enough when the diffs are small, but terrible when there are many diffs in series, creating a deep stack of references.

It's not obscure. $PATH and the /bin,/usr/bin, /usr/local/bin, $HOME/bin dirs on Linux work the same way.

A compiler can do a huge amount of black magic when the order of evaluations is not guaranteed.

And GHC exploits that liberty.