Hacker News new | ask | show | jobs
by triska 1268 days ago
Very nice!

This solution uses the library predicate list_to_set/2, relating a (known) list Ls0 of elements to a list Ls without duplicates, where the elements occur in the same order in which they first appear in Ls0. I think it is interesting to consider how such a relation can be described in Prolog, and also how efficient it can be.

An immediate solution suggests itself, considering the elements of Ls0 in the order they appear, and keeping track of the elements that have already been "seen". If an element is encountered that has already been seen, ignore it, otherwise it is part of the list Ls we want to describe. We can use a list to keep track of elements that have already been encountered:

    list_to_set(Ls0, Ls) :-
            phrase(firsts(Ls0, []), Ls).

    firsts([], _) --> [].
    firsts([L|Ls], Seen) -->
            (   { member(L, Seen) } ->
                []
            ;   [L]
            ),
            firsts(Ls, [L|Seen]).
This works correctly if the list is ground:

    ?- list_to_set("Corvus corax", Ls).
       Ls = "Corvus cax".
Yet, this solution has a very severe drawback: It is worst-case quadratic in the number of elements, and thus not usable for long lists:

    ?- length(_, E),
       E #> 10,
       N #= 2^E,
       numlist(1, N, Ls0),
       time(list_to_set(Ls0, Ls)).
yielding:

       % CPU time: 0.222s
       E = 11, N = 2048, Ls0 = [1,2,3,4,5,...], Ls = [1,2,3,4,5,...]
    ;  % CPU time: 0.880s
       E = 12, N = 4096, Ls0 = [1,2,3,4,5,...], Ls = [1,2,3,4,5,...]
    ;  % CPU time: 3.518s
       E = 13, N = 8192, Ls0 = [1,2,3,4,5,...], Ls = [1,2,3,4,5,...]
    ;  ... .
So, how to improve it? Well, it may be tempting to use for example a hash or an AVL tree to keep track of the "seen" elements, so that it can be more efficiently decided whether an element has already been encountered. And indeed, that is easy to do, and reduces the runtime considerably.

For example, using the commonly available library(assoc) for AVL trees, providing O(log(N)) lookup:

    list_to_set(Ls0, Ls) :-
            empty_assoc(A0),
            phrase(firsts(Ls0, A0), Ls).

    firsts([], _) --> [].
    firsts([L|Ls], A0) -->
            (   { get_assoc(L, A0, _) } ->
                []
            ;   [L]
            ),
            { put_assoc(L, A0, t, A) },
            firsts(Ls, A).
With this simple change, we get for the query above:

       % CPU time: 0.034s
       E = 11, N = 2048, Ls0 = [1,2,3,4,5,...], Ls = [1,2,3,4,5,...]
    ;  % CPU time: 0.070s
       E = 12, N = 4096, Ls0 = [1,2,3,4,5,...], Ls = [1,2,3,4,5,...]
    ;  % CPU time: 0.155s
       E = 13, N = 8192, Ls0 = [1,2,3,4,5,...], Ls = [1,2,3,4,5,...]
    ;  ... .
The most interesting part is that we can do significantly better, by leveraging Prolog's logic variables to propagate the information whether elements have already been encountered, yielding a very efficient solution where sorting the list Ls0 (or rather: the list of pairs LVs0, where we associate with each element of Ls0 a logic variable that can be used to propagate information by unifying it with other variables and more concrete terms) dominates the asymptotic complexity:

    list_to_set(Ls0, Ls) :-
            maplist(with_var, Ls0, LVs0),
            keysort(LVs0, LVs),
            same_elements(LVs),
            pick_firsts(LVs0, Ls).

    pick_firsts([], []).
    pick_firsts([E-V|EVs], Fs0) :-
            (   V == visited ->
                Fs0 = Fs
            ;   V = visited,
                Fs0 = [E|Fs]
            ),
            pick_firsts(EVs, Fs).

    with_var(E, E-_).

    same_elements([]).
    same_elements([EV|EVs]) :-
            foldl(unify_same, EVs, EV, _).

    unify_same(E-V, Prev-Var, E-V) :-
            (   Prev == E ->
                Var = V
            ;   true
            ).
We now get significantly improved performance:

       % CPU time: 0.003s
       E = 11, N = 2048, Ls0 = [1,2,3,4,5,...], Ls = [1,2,3,4,5,...]
    ;  % CPU time: 0.006s
       E = 12, N = 4096, Ls0 = [1,2,3,4,5,...], Ls = [1,2,3,4,5,...]
    ;  % CPU time: 0.013s
       E = 13, N = 8192, Ls0 = [1,2,3,4,5,...], Ls = [1,2,3,4,5,...]
    ;  ... .
And this is indeed how list_to_set/2 is implemented for example in Scryer Prolog's library(lists):

https://github.com/mthom/scryer-prolog/blob/fd19128530f68c46...

1 comments

I wanted to swing by with two notes.

First, for those who don’t recognize the username, this post was from Markus Triska, whose homepage (metalevel.at) is an absolute wealth of knowledge on Prolog. I’ve learned so much from it.

Second, for Markus: thank you :-)

I also would like to echo the thank you to Markus. His Power of Prolog videos have taught me so much about Prolog. His recent one on DCGs is incredibly eye-opening.
Echoing the echo :)

Thanks Markus.