Hacker News new | ask | show | jobs
by smacke 1131 days ago
Personally I don't like the "write to disk" approach; I think it kind of just punts the state problem somewhere else (i.e. from memory to disk). Writing to a database and adding versioning is better, but that's a lot of machinery to expect a notebook user to adopt (though maybe better tooling could help). Also a lot of Python objects are not out-of-the-box pickleable (e.g. generators). Also pickle is a mess.
1 comments

I definitely agree (and I think given what you work on you would be horrified by how I define cached functions that capture locals), but I think in practice getting to a state where you can restart your kernel often makes it easier to reason about state. But you’re definitely right, it would be better to reason correctly here.

One thing I’ve toyed with is writing a Jupyter kernel extension that notes what new locals you’ve defined in a cell, figures out what locals are read, and creates a (cached) function from the cell. E.g. a cell that has `y = a @ x + b` becomes

    @cache_to_disk
    def compute_y(a, x, b):
        return a @ x + b
    y = compute_y(a, x, b)
I don’t worry much about serialization - 90% of the time what I need to cache is dataframes (write to parquet), and the rest is trained models (custom serializer). People rarely need to cache generators, in my opinion.