Hacker News new | ask | show | jobs
by dgreensp 1544 days ago
These problems have lots of other well-known solutions.

For large objects: Persistent data structures with path-copying. Batching small mutations (and using mutation internally in a way that doesn’t leak out). Using the type system somehow to distinguish cases where copying is not necessary (experimental in certain new languages). Structuring your program around a “data store” that is mutable.

For the “references” problem: Explicit references, like by ID. Or mutable “atoms” as in Clojure, which are maybe sort of like the “mediators” in the article.

4 comments

Persistent data structures are very hard. It's basically impossible to cleanly combine them. E.g. if you have an immutable hashtable that contains immutable lists, and you want to modify a list - how do you know if the inner list changed or not (you don't want to be creating new hashtables if the elements don't change).

They work in the small, but don't truly scale.

> you want to modify a list - how do you know if the inner list changed or not

You know for sure that inner list did _not_ change, because it is immutable.

Someone else might have a reference to an updated hashtable with an updated inner list, but the one you hold will never change—isn’t that the whole idea of immutable data structures?

No, I mean "update" in the persistent sense - create a new list, with some (or none) elements changed/added/removed.

e.g. say you have a method remove_all_in_values that takes an int x and a hashmap of list of int, and returns a new hashmap of list of int, with x removed from each element of the hashmap.

Obviously you'd use something like List.filter for the inner operation, but most functional programming languages give you no indication whether the result of filter is the same list (i.e. element is not present) or a new list (i.e. some elements were removed).

So how do you know if you create a new hashtable or return the old one? One solution is complex, the other is wasteful.

No language will tell you at compile time whether an arbitrary list filter operation will have an effect or not. If it can, it's because the list filter isn't arbitrary, it's probably always a no-op.

The generalisation is that you have some type parent with a way to select a type child, plus a function that modifies a child, you want to lift your child -> child function to a parent -> parent function.

Optics help a lot with this class of problem.

Because you're in the world of persistent data structures, and you're doing a modify operation, there is no avoiding the 'modify hashtable' part of 'modify an element within a hashtable'. You always create a 'new' hashtable.

If it's common that your update operation has no effect it may be worth switching to a function like child -> Modified child for Modified a = NoOp | Mutated a, so the hashtable modify function can determine if it can skip its own update. A list filter is a trivial thing to write, and one that can return NoOp if no elements are filtered is no less efficient.

I see what you mean now.

I think most persistent data structures would (for most common operations) return exactly themselves in case when no modification happened. So you can very cheaply compare for equality, and that would either resolve to simple one-op comparison of references (to see if they are equal or not), or, in the worst case, comparison of some minor amount of diverging sub-elements (since persistent data structures do extensive structural sharing by definition).

In this sense, linked list is a “bad” persistent collection for your use case, but a persistent set instead would work wonderfully — if nothing changed, you’d get the same reference to a collection as input.

I am pretty sure it either would work this way, or is very easy to optimize for this use case, in Clojure.

Isn't the idea of persistent data structures, that they may be more expensive than their imperative versions, but try to reduce the expense to a minimum? So by using a persistent data structure, one should already have taken that potential additional expense into consideration. The wasteful solution should not be that much more wasteful, if we assume, that there is an OK-performing persistent data structure for the use-case. Otherwise I guess one should not choose a persistent data structure. But then you run into problems, when you want to make use of multiple cores to throw at the problem.
In FP I think composability and interchangeability of these higher order functions help, in case the supplied filter did not work like you want: You could write your own filter function variant that guarantees to return the same list if the predicate function returned true for each element.
You can pretty efficiently do this in Clojure

    (update-in
     {:foo [{:bar ["a"], :baz ["b" "c" "d"]}]}
     [:foo 0 :baz 1]
     clojure.string/capitalize)
    
    
    ;; => {:foo [{:bar ["a"], :baz ["b" "C" "d"]}]}
This only works if you know what to edit up-front.

If you're processing a tree-like structure via recursion (e.g. type-checking an AST implemented with persistent immutable structures in a compiler) it's no longer easy.

> how do you know if the inner list changed or not

This is a vital question that virtually every programming language overlooks.

If you are using a mainstream language, you do not know if they were modified.

If you use a language which has a type system to segregate mutable operations from immutable operations, then you know if they were modified.

> They work in the small, but don't truly scale.

Go even bigger then. Work with the kind of data that won't fit in memory. Put your data on disk - you can't insert into the middle of a 1GB+ file.

I can't stand the standard update-based interface of persistent data structures either. https://lib.rs/crates/im wraps persistent structures in a standard vector interface, with instant clones (only copying a pointer to refcounted data) but slower assignments (which may copy data internally if other references exist), which is effectively equivalent to CoW. A persistent update is instead written as a clone followed by in-place assignment, and you can mutate in-place (which copies if there are other references to the data being modified). The nice part is that only IndexMut detaches the underlying data from other references.

I haven't used this crate yet, but if I have a need for persistent data structures I'd definitely try this since it's more accessible.

I think a lot of the pain in Scala/Rust etc immutable data structures comes down to static typing. In Clojure for instance you can very ergonomically do deeply nested persistent updates.

    (update-in
     {:foo [{:bar ["a"] :baz ["b" "c" "d"]}]}
     [:foo 0 :baz 1]
     clojure.string/capitalize)

    => {:foo [{:bar ["a"], :baz ["b" "C" "d"]}]}
In Haskell you can you Lenses to easily manipulate sub-trees, individual values, or a selection of values in tree like data structures. It is very powerful.
In Clojure the default associative data structure is a persistent map and cloning it with 0 changed nodes is basically a no-op.
Persistent data structures exist and work very well but they still remain way slower than alternatives that don't support persistence. This is especially bad if you want the data structure to be shared across multiple threads. Just iterating over a tree now requires taking and releasing oodles of locks. Add in the GC overhead and you've got a meaningful problem.

Yes, you aren't constructing an entirely new 30k byte array on each mutation. But you are still adding a lot of overhead to your program.

> Persistent data structures exist and work very well but they still remain way slower than alternatives that don't support persistence.

Depending on your definition of "way" this is true. A lot of persistent vector-like structures can have quite good performance, but if your runtime is dominated by iteration and random updating, you will still get a speed up by using raw arrays, i.e. guaranteed contiguous blocks of memory.

> This is especially bad if you want the data structure to be shared across multiple threads. Just iterating over a tree now requires taking and releasing oodles of locks.

This is not. The whole point of immutable data structures and concurrency is that iterating over them does not require any locks because the structure won't change! So concurrent reads require absolutely no new code and certainly no locks. For concurrent writes, the usual strategy is a single CAS-like operation at the top-level to swap out the old immutable structure for a new one (e.g. Clojure atoms, Java AtomicReferences, Haskell IORefs, etc.). But that's a single thing for the whole structure, not every level.

In general concurrent update in the functional programming world is handled by CAS not locks.

CAS still has a lock in actual memory.
Not necessarily. Depends on the underlying hardware implementation (with different answers for e.g. x86 vs ARM) since software CAS usually is implemented as a single CPU instruction.

This is also stretching the definition of "lock" quite a bit because usually CAS is used to implement lock-free and wait-free concurrency.

But regardless at most, even in hardware, it's a single top-level lock, and sometimes not even that. Not one along every level of a data structure.

> This is especially bad if you want the data structure to be shared across multiple threads. Just iterating over a tree now requires taking and releasing oodles of locks.

Um, what?

Persistent data structures don't require locks at all (because they're immutable), in fact using them in a multithreaded way is 100% safe and thus much easier than their mutable (and ordinarily more efficient) counterparts.

Refcounted gc means you need locking on refcounts or you can lose your tree during iteration.
Can you provide a list of reference-counted functional languages?
I really enjoyed this presentation on how roc-lang is being designed to handle optimization using opportunistic in-place mutation: https://yewtu.be/watch?v=vzfy4EKwG_Y&t=1328.
I was thinking along the same lines. Is an "atom" in Clojure like STM where the structure/reference is changed in a highly controlled manner?
atoms are containers that require no coordination so its a simple Compare and swap. There are Refs that require transactions and change of multiple Refs is coordinated in these transactions. WIth that Clojure has an implementation of Software Transaction Memory.