| HN Mirror

The dependency tracking and the graph/SQL thing are largely independent pieces.

The graph/SQL thing is used to query the memoization tables of the memoized functions by "joining them along a given computational graph". This roughly means that you can point to a computation, and ask: give me a table of the values of such-and-such variables in this computation across all analogous computations in the storage. Here, "analogous" could mean, for example, running the same composition of functions, but with different input parameters. The motivation is to be able to easily ask a broad class of natural queries of your storage (e.g., how given outputs depend on given inputs across all experiments of a given kind).

For example, say you have two memoized functions, `increment` and `add`, and run this:

with storage.run():

    for i in [1, 2, 3]:

        x = increment(i)

        y = add(i, x)

Running this code will memoize a bunch of calls to `increment` and `add` in a way that recognizes that the output of the call to `increment` is the second input to the call to `add` (behind the scenes, they point to the same saved object). Then, if you call `storage.similar(x, y)`, you'll see a message like this:

Pattern-matching to the following computational graph (all constraints apply):

    a0 = Q() # input to computation; can match anything

    a1 = Q() # input to computation; can match anything

    x = increment(a=a0)

    y = add(a=a1, b=x)

    result = storage.df(x, y)

and a table like this:

x y

2 3

3 5

4 7

which tells you all the values `x, y` have taken in executions of this kind of program. Since you've only ran this with `i = 1, 2, 3`, you get back just the result from this computation - but if you've been running similar things for weeks, you may not remember all the settings you've ran this with so far - and this query will reveal them.

How this works is that you look for values `a0, a1, x, y` in the storage that satisfy both the constraints `x = increment(a=a0), y = add(a=a1, b=x)`. Finding such values amounts to joining the memoization tables of `increment` and `add` in a certain way. Since this way is implicit in the computation itself, you can just point to the variables and get the result.

Hope that helps! I realize that this is a fairly non-standard way to query, but I believe it's well worth the benefits in terms of reducing the boilerplate required to interact with computational data. Let me know if you think of ways to improve the presentation or have any other questions!