| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by singhrac 703 days ago

This is really nice. I'll take a much closer look when I get time later. I'm very interested in how you think about incremental computation - I find cache invalidation to be incredibly annoying and time-consuming when running experiments.

We wrote our own version of this (I think many or all quant firms do, I know such a thing existed at $prev_job) but we use type annotation inspection to make things efficient (I had ~1-2 days to write it, so had to keep the design simple as well). It's a contract: if you write the type annotation, we can store it. Socially this incentivizes all the complex code to becoming typed. We generally work with timeseries data, which makes some things easier and some things harder, and the output format is largely storable in parquet format (we handle objects, but inefficiently).

One interesting subproblem that is relevant to us is the idea of "computable now", which adds a kind of third variant from the usual None/Some (i.e. is something intrinsically missing, or is it just not knowable yet?). For example, if you call total_airline_flights(20240901), that should (a) return something like an NA, and (b) not cache that value, since we will presumably be able to compute it on 20240902. But if total_airline_flights(20240101) returns NA, we might want to cache that, so we don't pay the cost again.

We sidestep this problem in our own implementation (time constraints) but I think the version at $prevjob had a better solution to this to avoid wasting compute.

(side note: hey Alex! I took 125 under you at Harvard; very neat that you're working on this now)

1 comments

amakelov 701 days ago

Thanks Rachit! Great running into you after all these years!

Being aware of types is certainly a must in a more performance-critical implementation; this project is not at this stage though, opting for simplicity and genericity instead. I've found this best for maintenance until the core stabilizes; plus, it's not a major pain point in my ML projects yet.

Regarding incremental computation: the main idea is simple - if you call a function on some inputs, and this function was called on the same (modulo hashing) inputs in the past and used some set of dependencies that currently have the same code as before (or compatible code, if manually marked by the user), the past call's outputs are reused (key question here: will there be at most one such past call? yes, if functions invoke their dependencies deterministically).

You can probably add some tools to automatically delete old versions and everything that depends on them, but this is definitely a decision the user must make (e.g., you might want to be able to time-travel back to look at old results). I'm happy to answer more nuanced questions about the incremental computation implementation in `mandala` if you have any!