Hacker News new | ask | show | jobs
In Lisp, should lists be replaced with trees? (hxa.name)
47 points by bawatski 5515 days ago
15 comments

A list structure is already a 'dictionary tree', except it has lousy performance in access. (only really visible on large data sets).

* (second (assoc 5 '((1 2) (3 4) (5 6))))

> 6

The lousy performance doesn't matter, most of the time, because you have other data-structures for data, and your compiler generally is only interested in sequential access (constructing a parse tree out of it). The lisp code already being a tree makes parsing fairly simple...

So, I guess my point is that I don't see the point of this?

The only reason to really upgrade to a dictionary would be if you wanted to have non-sequential access to the contents of of your large-scale code-data-structure... generally the lists aren't large enough to merit it. You can do all of the dictionary operations on a list, and they are 'fast enough'. ----

I feel like the rationale here is created by conflating the idea of a cons and a list. I think generally, a list already accomplishes what the author sets out to accomplish.

They are a single step higher level than conses, and give you a hierarchical structure with the possibility for named associations. I don't really see how using something more complicated and memory intensive to build this basic data structure qualifies as an improvement...

> The lousy performance doesn't matter, most of the time, because you have other data-structures for data, and your compiler generally is only interested in sequential access ...

Exactly. Much of the time you don't care. The rest of the time, you can still use assoc-lists, except you create a branching structure on the keys such that no key in the list is a prefix of any other. The first entry of a node contains the value "at" that node, and the tail of the node is a list of branches, where each branch is a pair consisting of a string key and a child node.

The "get" and "put" functions in this structure are very simple recursive functions, and you can maintain very large dictionaries ("maps") this way. The get and put operations are essentially O(1), constant time, if you ignore the sheer length of the keys (which are usually reasonably short anyway). I've done tests where I insert hundreds of thousands of pseudo-random keys and values, and read them back, all very quickly and reliably. The code splits the keys into branching structures automatically. It's also a purely functional structure, so you can keep old versions of the maps.

The value at each node can be anything you like, not just a string, though to detect the absence of a value you need to use a "Maybe" type. In Fexl you can define constructors:

  \absent  = (       \absent\present absent)
  \present = (\value \absent\present present value)
So then the default value of a newly created branch is always absent, but a "put" operation replaces it with (present value). When you delete a key, it automatically detects if the node is empty (i.e. its value is absent and it has no branches) and then eliminates the node from its parent.

You also detect singleton nodes, which have an absent value and only one branch. In that case you can concatenate the key of that branch with the parent key upstairs, eliminating the extra branch.

Exactly. Rose by any name smells just as good.
not directly lisp related, but the team working on Fortress has some interesting work on using "conc lists" (pairs, and trees as a combination of pairs) as a basic unit of work for parallel processing[1], the analysis quite naturally considers its relation with lisp ad cons lists in other languages.

[1] http://labs.oracle.com/projects/plrg/Publications/ICFPAugust...

Also, Guy Steele is one of the two inventor of Scheme. So it may not be directly lisp related, but it definitely is indirectly lisp related ;-).
I think that's directly related. The answer to the question "Why should we replace our lists?" is that you can avoid known failure cases with lists at a relatively small price in complexity and memory in the worst case, and sometimes get big gains. Singly-linked lists are not suitable for a concurrent world and functional programming really needs to stop putting lists on a pedestal and writing them into their very syntax. There's actually other problems they cause, too, but the concurrent problem is a real killer, because you can not just program your way around it, you need a fundamentally different data structure.
http://vimeo.com/6624203 Conc lists as video. Titled Organizing Functional Code for Parallel Execution; or, foldl and foldr Considered Slightly Harmful (Guy Steele)
A better reason to replace lists with trees might be for parallel execution. I enjoyed this talk by Guy Steele on replacing lists with trees in Fortress: http://vimeo.com/6624203
"Branchs/trees are a superset of pairs/lists"

No, they aren't. A tree is acyclic by definition. Cons cells allow circular structures.

To take a familiar example, the Web is not a tree. While you might argue that each page has unique children (outlinks), there is no unique parent -- a page can have an arbitrarily large number of incoming links, and the page can link back to those pages in turn, producing cyclic structures of immense complexity.

For some interesting discussion on the limits of tree structures with respect to urban planning see A City is Not a Tree by Christopher Alexander (who is also the inspiration for the Design Patterns movement in software engineering). http://www.rudi.net/node/317

All good, just a nitpick: You don't need a unique parent to have a tree. You don't even need directed edges. :)

The purest definition of a tree graph is "a graph with no cycles".

We use different definitions. My trees are acyclic connected graphs with one root.
The structures described are called trees in the post, but are actually just sets of named objects, which can be used to describe cyclic structures in the same way conses can.
And this is useful: you can represent the transitions of a state machine in this way, for instance.
The structure he calls a 'tree' allows cycles and is a generalization of a list. Whether he uses the correct word isn't really interesting: it doesn't detract anything from the post.
Will Thimbleby (http://will.thimbleby.net/misc-an-experimental-language/) also explored similar ideas.
It's easy to implement both lists and name-value associations in terms of pairs, so I'm not sure why they need to be built in.

In Fexl (http://fexl.com/code), the only built in concept is the function, and all data is structured out of functions. I won't go into all the details here, but concepts like pair, list, key-value assoc, branch, and everything else are readily expressed as pure functions. So-called "circular" structures are created with the Y combinator, so they really are not circular in memory, only in concept.

There's an interesting write up about the idea of "everything is a function" in Haskell at http://conal.net/blog/posts/everything-is-a-function-in-hask....

What's your rationale vbehind everything being a function?

The seed was planted back in 1989 when I read a paper by Jørgen Steensgaard-Madsen in Communications of the ACM titled "Typed Representation of Objects by Functions" (CACM, January 1989, Volume 11 Number 1). I found it brilliant and I never could shake the idea after that.

Even something as lowly as a bit (Boolean value true or false) can be represented as a function. In Fexl, the functions for T and F are expressed as:

  \T = (\T\F T)
  \F = (\T\F F)
In short, the T function returns its first argument, and the F function returns its second argument. So you can do things like this:

  eq x 4 (print "Yes, it's 4") (print "No, it's not 4")
You don't even need an "if" function. Or, if you like the way "if" looks, you can define it as the identity function:

  \if = (\x x)
Then you can say:

   if (eq x 4) (print "Yes, it's 4") (print "No, it's not 4")
You can also define "lists" as functions. There are two cases, "null" and "cons":

  \null = (           \null\cons null)
  \cons = (\head\tail \null\cons cons head tail)
Now if you have a list called "groceries", you can do this:

  groceries
    (print "You don't need anything at the moment")
    \head\tail
      print "You need "; print head;
      print "and possibly some other things."
If you aren't comfortable "calling" a list object as a function, you can define the more familiar "observer" functions:

  \empty = (\list list T \_\_ F)
  \head = (\list list undef \head\_ head)
  \tail = (\list list undef \_\tail tail)
Then you could say:

  if (empty groceries)
    (print "You don't need anything at the moment")
    (print "You need "; print (head groceries);
      print " and possibly some other things.")
If you want an ordered pair, here you go:

  \pair = (\x\y \pair pair x y)
Now you can say:

  \the_pair = (pair 2.6 3.8)
  ...

  the_pair \x\y
    print "left is ";  print x;
    print "right is "; print y;
So, to sum it all up, I am just irresistibly attracted to the utter profundity of this concept. I like how the components of a piece of data simply "present themselves" to the handler function(s) applied to it. Also, you can name your data constructors anything you like -- there are no ultimately predefined names for anything, except any primitive combinators built into the C code, but even these can be shadowed or hidden. So if you want to use "item" and "end" instead of "cons" and "null", you can. (I do.) And if you want to use "first" and "rest" instead of "head" and "tail", you can.

If you like, you can download the code at http://fexl.com/code . One of these days I should put a sandboxed demo interpreter up on the site.

Although I keep hearing “everything is a function” (and 3 is a nullary or constant function), I don’t hear people say “everything is a list”, and 3 is really the singleton list [3].

Well, in musings around Arc, pg did mention defining everything as a list, including 3 as [[] [] []], I believe. So, it's not that no one says it. :)

Although Fexl does have built-in functions for long int and double float values, it is possible to do without them.

There are a couple of ways to define integers as pure functions. One is unary notation, which is equivalent to the "3" example you cite above. So you'd have these two constructors:

  \zero = (   \zero\succ zero)
  \succ = (\N \zero\succ succ N)
And then the "add" would be:

  \add == (\x\y x y \n succ (add n y))
Or, using the ';' (right-pivot) to avoid nesting on the right (though sometimes that can be too clever by half):

  \add == (\x\y x y \n succ; add n y)
That works fine for some applications. But you can use binary notation as well, defining an integer as a list of bits, with the lowest-order bit first. Then the "add" function becomes:

  \add == (\x\y x y \bx\nx y x \by\ny
     \sum=(add nx ny)
     bx
        (by (cons F (inc sum)) (cons T sum))
        (cons by sum))
Where inc is:

  \inc == (\x x (cons T null) \b\n
     b (cons F (inc n)) (cons T n))
I think I transcribed that right, but I haven't tested it in a while. It's all surprisingly fast -- I was actually doing long division of 100 digit numbers years ago in reasonable time in this notation, including the conversion to decimal notation by repeated division by 10.

One of these days I need to link a big number library into Fexl, but it hasn't been a priority for me. (Actually I need to think seriously about dynamically linked libraries, with linkage control done in Fexl itself. This can be sandboxed for secure applications, of course.)

Oh and here's my old division function, which again I haven't tested in a while, so no warranties expressed or implied:

  #---------------------------------------------------------------------------
  # (nat:div x y) divides x by y.  It yields a pair <q,r>, where q is the
  # quotient and r is the remainder.
  #
  # The result satisfies the equation x = q*y + r,  0 <= r < y.
  #
  # NOTE:  If you divide by zero, the function yields the pair <0,0>.
  #---------------------------------------------------------------------------
  \nat:div==(\x\y\:
      x (: null null) \bx\nx
      y (: null null) \by\ny
      by
      (
          # divide by odd - recur on x only
          nat:div nx y \q\r
              \r=(bx nat:2x1 nat:2x r)
              \d=(nat:sub r y)
              int:ge0 d
                  (: (nat:2x1 q) (int:abs d))
                  (: (nat:2x q) r)
      )
      (
          # divide by even - recur on x and y
          nat:div nx ny \q\r
              : q (bx nat:2x1 nat:2x r)
      )
  )
Lisp is already made of trees. A Lisp program is a pre-order traversal of the abstract syntax tree of a yacc'ed program in another programming language. This fact is WHY Lisp is powerful. By eliminating yaccing you can create new syntax on the fly without having to recompile your compiler.
By "replacing" lists with trees, am I supposed to understand that working with trees will be natural and easy, and lists will feel like second-class citizens? I thought we were beyond thinking like that. Building languages around a favorite data structure was never the right thing to do; it was always tunnel vision and bad design. Different data structures have different features and different performance characteristics. There's no reason for a language designer to pick a favorite data structure and force anyone who has to use a different one to deal with warts and second-class library and language support.
He seems to have heard of Lua but that didn't stop him from raising the question.

Next article: "Should airplanes be fitted with big rotor on top to allow for vertical take off and hovering? .. like helicopters?"

not that jets need a big rotor on top to do vertical takeoff and hovering

http://en.wikipedia.org/wiki/Harrier_Jump_Jet :)

The rotor on top has been done. http://en.wikipedia.org/wiki/V-22_Osprey
I wouldn't say lists should be replaced with trees, but adding a rich set of tree-based data structures to Lisp improves it substantially, in my view. To see what I mean, have a look at my FSet functional collections package, which does exactly that:

  http://common-lisp.net/project/fset/
Unlike what the OP is suggesting, FSet doesn't expose the internal structure of its trees. Instead it wraps them in a rich set-theoretic interface.
You can just build your tree data structure and develop operators to work with them in Lisp. There's nothing that restricts you to using lists.

So what's the point of "replacing" them (even if such a thing were possible without "replacing" Lisp)?

Cons cells give you binary trees. It is possible to encode any arbitrary tree as a binary tree. However, this does not mean that the expression of algorithms over such left child right sibling encoded trees is identical to one over trees with some higher branching factor. So the cons/binary situation will at the least require some additional layers of abstraction to simulate various traversal patterns of the n way tree on the encoding binary tree.

There also physical/efficiency concerns. There's a reason databases use highly branching trees.

And you can create the abstractions necessary to work with such a data structure and add them straight to the regular grammar of the language.

Lisp also has all of the other low-level primitives we're used to when working closer to the machine. Structs, indexed arrays, hashes, etc. If efficiency is a concern, just pick an implementation that compiles to machine code. SBCL is pretty good for this depending on your target platform.

Replacing conses seems like a waste of time.

A few people have asked, what is the point? since trees and associative access can just be implemented with lists.

It is the same basic point as having 'let' or 'case' etc. (which can all just be implemented with 'lambda' and 'if' etc.) -- it is just a bit higher-level to program with.

With something that makes programming a little easier (maybe), probably the question ought really be, why shouldn't we have it?

I think the answer is we should have it, but it's just a library.
OK. Then I suppose the question is: why should it be library not built-in?

The main thing not addressed in the article is: what are the language implementation implications? What are the pros and cons there?

In my way of thinking, the distinction between "library" and "built-in" is very fluid. That is, if you run an executable which automatically loads a library for you, then voila it is now built-in. If nothing else, you can set up your own local environment that you favor. I realize this has portability implications, but keep in mind that not everyone will want "bignums" or "socket operations" or "openssl" available all the time in every program.

For me, the language implications are nil. I would not change a single thing in Fexl to support name-value "trees" versus lists. It's all programmable using the system as it is. The S, C, Y, I, L, R combinators are what they are, forever and ever, world without end. :) And I'm quite sure the same thing is true of Lisp (i.e. just use defun or maybe defmacro). You'll notice that even in the original article the author didn't propose any new syntax. It's all just more functions.

I mean, even in Perl for goodness sake all you have to do is this:

    # A map is a list of branches.  Each branch is pair(key,data).  The data is
    # pair(val,map).  No two keys in a map have any prefix in common.  A key cannot
    # be null.  If you put a null val at a key, it deletes that key from the map.

    sub map_put
        {
        my $map = shift;
        my $key = shift;
        my $val = shift;

        die if !defined $key || ref($key) ne "" || $key eq "";
        die if !defined $val;

        if (is_atom($map))
            {
            return $map if is_null($val);
            return pair(pair(atom($key),pair($val,null())), $map);
            }

        my $branch = left($map);
        my $old_key = name(left($branch));

        my $cmp = $key cmp $old_key;

        if ($cmp == 0)
            {
            my $sub_map = right(right($branch));

            if (is_null($val))
                {
                return right($map) if is_atom($sub_map);

                if (is_atom(right($sub_map)))
                    {
                    # Combine singleton map upstairs.

                    my $top_key = name(left($branch));
                    my $next_key = name(left(left($sub_map)));

                    my $new_branch = pair(atom($top_key.$next_key),
                        right(left($sub_map)));
                    return pair($new_branch,right($map));
                    }
                }

            my $new_branch = pair(left($branch), pair($val,$sub_map));
            return pair($new_branch,right($map));
            }

        if (substr($key,0,1) eq substr($old_key,0,1))
            {
            my $len_common = len_common_prefix($key,$old_key);

            my $key_suffix = substr($key,$len_common);
            my $old_key_suffix = substr($old_key,$len_common);
            my $common_prefix = substr($key,0,$len_common);

            if ($key_suffix eq "")
                {
                return $map if is_null($val);

                my $sub_map =
                    pair(pair(atom($old_key_suffix),right($branch)),null());
                my $new_branch = pair(atom($common_prefix),pair($val,$sub_map));
                return pair($new_branch,right($map));
                }

            if ($old_key_suffix eq "")
                {
                my $old_sub_map = right(right($branch));
                my $new_sub_map = map_put($old_sub_map,$key_suffix,$val);
                my $old_val_atom = left(right($branch));

                if (name($old_val_atom) eq "")
                    {
                    return right($map) if is_atom($new_sub_map);

                    if (is_atom(right($new_sub_map)))
                        {
                        # Combine singleton map upstairs.
                        # TODO hey wait!  This is the same code as above.  Unify this.

                        my $top_key = name(left($branch));
                        my $next_key = name(left(left($new_sub_map)));

                        my $new_branch = pair(atom($top_key.$next_key),
                            right(left($new_sub_map)));
                        return pair($new_branch,right($map));
                        }
                    }

                my $new_branch = pair(left($branch),
                    pair($old_val_atom,$new_sub_map));
                return pair($new_branch,right($map));
                }

            return $map if is_null($val);

            my $old_branch = pair(atom($old_key_suffix),right($branch));
            my $sub_map = map_put(pair($old_branch,null()),$key_suffix,$val);

            my $new_branch = pair(atom($common_prefix),pair(null(),$sub_map));
            return pair($new_branch,right($map));
            }

        if ($cmp < 0)
            {
            return $map if is_null($val);
            return pair(pair(atom($key),pair($val,null())), $map);
            }

        return pair($branch,map_put(right($map),$key,$val));
        }

    sub map_get
        {
        my $map = shift;
        my $key = shift;

        die if !defined $key || ref($key) ne "" || $key eq "";

        return null() if is_atom($map);

        my $branch = left($map);
        my $old_key = name(left($branch));

        my $cmp = $key cmp $old_key;

        return left(right($branch)) if $cmp == 0;

        if (substr($key,0,1) eq substr($old_key,0,1))
            {
            my $len_common = len_common_prefix($key,$old_key);
            my $key_suffix = substr($key,$len_common);
            my $old_key_suffix = substr($old_key,$len_common);

            return map_get(right(right($branch)),$key_suffix)
                if $old_key_suffix eq "";
            return null();
            }

        return null() if $cmp < 0;
        return map_get(right($map),$key);
        }
That would be a lot cleaner in Fexl or Lisp, but it's totally do-able in any language.
This is either a reinvention of ASTs or it's Graph Reduction.

http://en.wikipedia.org/wiki/Abstract_syntax_tree http://en.wikipedia.org/wiki/Graph_reduction

:)

You provided two things that the original post is a reinvention of. Which is it? Why not read the article enough to decide which it is a reinvention of, then present your thinking in your comment?
Lists are trees.
Wait, I thought trees are lists.
No, trees are multidimensional lists. Lists are one-dimensional lists, which are degenerate trees and a simple primitive for the more general (always a loaded term) structure.
Wouldn't that make it a new language called Trep?