Hacker News new | ask | show | jobs
by sitkack 711 days ago
Does it survive restarts? You mention that they are exported as dataframes, can they be reimported? Does this mean we can run mandala on many machines, and merge data frames together to get collective memoization?

Do you support persisting into external stores?

You mention incpy in readme, have you discussed this project with Philip Guo? https://pg.ucsd.edu/

What is the memory and cpu overhead?

How does the framework handle dependencies on external libraries or system-level changes that might affect reproducibility?

How do you rollback state when it has memoized a broken computation? How does one decide which memoizations to invalidate vs keep?

1 comments

In order,

1. Yes, you can choose to create a persistent storage by passing `db_path` to `Storage()`. The current implementation is just an SQLite file. To run on many machines, you don't really need to be able to re-import from a dataframe (dumping to a dataframe is meant to be an exit point from `mandala` so that you can do downstream analyses in a format more familiar than `ComputationFrame`) - `ComputationFrame`s can be merged via the union (`|`) operator, see here https://amakelov.github.io/mandala/blog/01_cf/#tidy-tools-me... for an example. Storages don't support merging yet, but it's certainly possible!

2. Already answered in 1.

3. Nope, but I'd be happy to (though I feel like `mandala` took memoization in a substantially different direction). Are you in a position to make an introduction?

4. This project is currently not optimized for performance, though I've used it in projects spanning millions of memoized calls. The typical use case is to decorate functions that take a long time to compute, so the overhead of memoization amortizes. A very quick benchmark on my laptop shows ~6ms per call for in-memory storage, ~9ms for a persistent storage, with a simple arithmetic function that otherwise takes ~0 time.

5. Great question - currently, the dependency tracer is restricted to user-chosen functions to avoid tracking function calls an imported library makes. You could use a bit of magic (import-time automatic decoration) to track all functions in a file or a directory (not implemented right now). The reasoning is that, for a typical multi-month ML project, you usually have a single conda environment so you want to ignore library changes. Similarly, system-level (e.g. environment variables) are also not tracked. I think a very useful feature would be to at least record the versions of each imported library, so that storages can be ported between environments with some guarantees (or warnings).

6. - If an `@op` call was memoized, the underlying Python function call succeeded, so in this sense it can't be "broken"; it's however possible that there was a bug. In this case, you can delete the affected calls and all values that depend on them (if you keep these values, you're left with "zombie" values that don't have a proper computational history). The `ComputationFrame` supports declarative deletion - you build a ComputationFrame that captures the calls you want to delete, and call `.delete_calls()` - though there's still no example of this in the tutorial :) Alternatively, you can change the affected function and mark this as a new version. Then you should be able to delete all calls using the previous version (though, not supported at this moment).

- How the cache is invalidated is detailed here: https://github.com/amakelov/mandala?tab=readme-ov-file#how-i...