Hacker News new | ask | show | jobs
by tomp 1545 days ago
Persistent data structures are very hard. It's basically impossible to cleanly combine them. E.g. if you have an immutable hashtable that contains immutable lists, and you want to modify a list - how do you know if the inner list changed or not (you don't want to be creating new hashtables if the elements don't change).

They work in the small, but don't truly scale.

4 comments

> you want to modify a list - how do you know if the inner list changed or not

You know for sure that inner list did _not_ change, because it is immutable.

Someone else might have a reference to an updated hashtable with an updated inner list, but the one you hold will never change—isn’t that the whole idea of immutable data structures?

No, I mean "update" in the persistent sense - create a new list, with some (or none) elements changed/added/removed.

e.g. say you have a method remove_all_in_values that takes an int x and a hashmap of list of int, and returns a new hashmap of list of int, with x removed from each element of the hashmap.

Obviously you'd use something like List.filter for the inner operation, but most functional programming languages give you no indication whether the result of filter is the same list (i.e. element is not present) or a new list (i.e. some elements were removed).

So how do you know if you create a new hashtable or return the old one? One solution is complex, the other is wasteful.

No language will tell you at compile time whether an arbitrary list filter operation will have an effect or not. If it can, it's because the list filter isn't arbitrary, it's probably always a no-op.

The generalisation is that you have some type parent with a way to select a type child, plus a function that modifies a child, you want to lift your child -> child function to a parent -> parent function.

Optics help a lot with this class of problem.

Because you're in the world of persistent data structures, and you're doing a modify operation, there is no avoiding the 'modify hashtable' part of 'modify an element within a hashtable'. You always create a 'new' hashtable.

If it's common that your update operation has no effect it may be worth switching to a function like child -> Modified child for Modified a = NoOp | Mutated a, so the hashtable modify function can determine if it can skip its own update. A list filter is a trivial thing to write, and one that can return NoOp if no elements are filtered is no less efficient.

I see what you mean now.

I think most persistent data structures would (for most common operations) return exactly themselves in case when no modification happened. So you can very cheaply compare for equality, and that would either resolve to simple one-op comparison of references (to see if they are equal or not), or, in the worst case, comparison of some minor amount of diverging sub-elements (since persistent data structures do extensive structural sharing by definition).

In this sense, linked list is a “bad” persistent collection for your use case, but a persistent set instead would work wonderfully — if nothing changed, you’d get the same reference to a collection as input.

I am pretty sure it either would work this way, or is very easy to optimize for this use case, in Clojure.

Isn't the idea of persistent data structures, that they may be more expensive than their imperative versions, but try to reduce the expense to a minimum? So by using a persistent data structure, one should already have taken that potential additional expense into consideration. The wasteful solution should not be that much more wasteful, if we assume, that there is an OK-performing persistent data structure for the use-case. Otherwise I guess one should not choose a persistent data structure. But then you run into problems, when you want to make use of multiple cores to throw at the problem.
In FP I think composability and interchangeability of these higher order functions help, in case the supplied filter did not work like you want: You could write your own filter function variant that guarantees to return the same list if the predicate function returned true for each element.
You can pretty efficiently do this in Clojure

    (update-in
     {:foo [{:bar ["a"], :baz ["b" "c" "d"]}]}
     [:foo 0 :baz 1]
     clojure.string/capitalize)
    
    
    ;; => {:foo [{:bar ["a"], :baz ["b" "C" "d"]}]}
This only works if you know what to edit up-front.

If you're processing a tree-like structure via recursion (e.g. type-checking an AST implemented with persistent immutable structures in a compiler) it's no longer easy.

> how do you know if the inner list changed or not

This is a vital question that virtually every programming language overlooks.

If you are using a mainstream language, you do not know if they were modified.

If you use a language which has a type system to segregate mutable operations from immutable operations, then you know if they were modified.

> They work in the small, but don't truly scale.

Go even bigger then. Work with the kind of data that won't fit in memory. Put your data on disk - you can't insert into the middle of a 1GB+ file.

I can't stand the standard update-based interface of persistent data structures either. https://lib.rs/crates/im wraps persistent structures in a standard vector interface, with instant clones (only copying a pointer to refcounted data) but slower assignments (which may copy data internally if other references exist), which is effectively equivalent to CoW. A persistent update is instead written as a clone followed by in-place assignment, and you can mutate in-place (which copies if there are other references to the data being modified). The nice part is that only IndexMut detaches the underlying data from other references.

I haven't used this crate yet, but if I have a need for persistent data structures I'd definitely try this since it's more accessible.

I think a lot of the pain in Scala/Rust etc immutable data structures comes down to static typing. In Clojure for instance you can very ergonomically do deeply nested persistent updates.

    (update-in
     {:foo [{:bar ["a"] :baz ["b" "c" "d"]}]}
     [:foo 0 :baz 1]
     clojure.string/capitalize)

    => {:foo [{:bar ["a"], :baz ["b" "C" "d"]}]}
In Haskell you can you Lenses to easily manipulate sub-trees, individual values, or a selection of values in tree like data structures. It is very powerful.
In Clojure the default associative data structure is a persistent map and cloning it with 0 changed nodes is basically a no-op.