Hacker News new | ask | show | jobs
by hinkley 334 days ago
I found this video very helpful:

https://m.youtube.com/watch?v=oBt53YbR9Kk

Yes, it is 5 hours long. At least 4 of those are worth it.

Many people confuse caching with DP. I’ve had that conversation too often. I think it’s down to the memoization examples people choose being too toy and just looking like caching. Caching is global shared state, whereas memoization can be scoped to the current calculation and still be useful.

But they always skip over tabulation, which is where I believe DP distinguishes itself.

3 comments

The key difference between caching and memoization is not global shared state, it's determinism. That's why the latter mostly applies locally. In that sense, they're only semantically different. Data in a cache can be expected to go stale; it must not if you rely on the structure as a memoized equivalent to calling the function with the given inputs. In practice, you can have global cache whose data is deterministic and you can also have local cache that goes stale. The latter is not memoization.
I think you’re quibbling. Sharing state between independent tasks leads to nondeterminism. That’s the whole problem with shared state. Spooky action at a distance creates nondeterminism and even undefined behavior.
> I think you’re quibbling.

I can assure you I'm not.

There are absolute claims to make, they just aren't the ones you're making. Determinism is an absolute claim of memoization. "Local rather than shared state" is simply not, no matter how many time you repeat it is.

Excluding shared state from memoization implies that you can't rely on the technique in parallel scenarios, which is just not true. I can think of more than a handful of DP problems where you can further optimize with parallelism if the function is modified to rely on shared state memoization. For example, the slew of problems that rely on the tabulation to simply fetch one value in the "previous row" and the value at the "previous column" of the current row. In practice, instead of iterating through rows and waiting to completely process them one at a time, you could simply start processing another row in parallel once you have just enough of the requisite "previous" data. This is absolutely valid DP. Some DP solutions even receive their tabulation. How does that not qualify as global?

If you thought "memoization is never shared", then in practice, not only your possibly very deterministic functions would end up solving the same problems over and over each time they're called, you would also deny yourself the opportunity of tremendous optimization techniques given by a guarantee of determinism.

Determinism is key. Not shared state.

> Many people confuse caching with DP. I’ve had that conversation too often.

This is the beginning of my thesis. Without this no other conversation is productive.

The problem with the DP space in specific and coding in general is that a substantial fraction of 'everyone' hears 'memoization' and immediately substitutes caching into their thought processes, "oh that's just caching, got it," and then believes they are productively contributing to the followup conversation when really all they are doing is confusing everyone else and themselves.

That's my beef, frustration, and frankly, institutional trauma. Equating them in your brain makes everything you say from them on practically useless.

You have a fair point about parallelism, but even here the sharing you illustrate is between parallel sub-problems, the 'sub' is IMO doing all of the work in that sentence. Versus me caching a header because you and I have the same account settings, only there's a bug because it's now showing you my unread message status due to incorrect sharing.

But dp is a form of caching, no? It sacrifices space by caching intermediate results to compute successive results. The only reason it is called dp is because the "inventor" (Iirc Bell) needed a cool name.
DP is not a form of caching. DP is a way of breaking down a problem into subproblems. Caching (memoization) is one way to use this subproblem structure to reduce the amount of computation that ends up being performed in the end.
DP is a resolution technique that uses progressive (i.e. "dynamic") memoization (i.e. "programming" as in "tabulation", or resulting data stored in and retrieved from a table) as a way to curtail the cost of overlapping (i.e. intermediary and repetitive) subproblems. Memoization is a form of deterministic caching.
No, memoization is not caching. That’s a reductive understanding that causes no end of trouble.

Caching is global shared state. It brings nearly every problem that global shared state brings with it. If you conflate the two you miss all of the advantages of DP and “just use caches”.

Memoization in the scope of DP (some languages misuse the word in their APIs) is acknowledging that a problem is self recursive and eliminating the duplicate work by either inverting the problem or storing previously calculated values. These are usually call graph scoped. Nobody else sees the value. For instance in the case of solving Fibonacci with iteration, you need only remember the previous two values, which you store in registers. There is no table. There is no O(n) storage space.

Again, tabulation is when the contrast becomes stark. You’re creating an array where the relationship between the values contains information that makes the calculations cheap.

No definition of "caching" I know imply global shared state. Local private caches are commonplace.
I don’t know what rock you’re living under but Redis is absolutely GSS, and any lookup table accessible between different tasks is shared state. In a language with threads anything the entire app can see is considered global shared state.
I think you misunderstand. DP is not equal to "caching", but in the set of all things that are "caching" DP is contained. I cannot think of any DP algorithm that does not explicitly or implicitly use a memory cache. Can you? Even in the case of DP Fibonacci the two registers of previous values constitute a cache.
I was given a programming interview once where the interviewer gave me a DP problem (which I didn't know at the time), and he failed me because I used a cache instead of memoization. Asides from being inelegant, what are the downsides to caching?
Memoization is caching so that's a strange reason to fail someone. You cache based on the function call and its arguments.

  SOME_FUN = dict(base_cases)
  def some_fun(a, b, c):
    if (a,b,c) in SOME_FUN: return SOME_FUN[(a,b,c)]
    result = some_fun(a - 1, b - 1, c) + some_fun(a, b, c-1) # or whatever the recursive calls would be
    SOME_FUN[(a,b,c)] = result
    return result
That's it. Some languages (Python, Lisp) make it easy to take the naive version and wrap it (Python decorators are useful here) so that you don't have to write that code above. That's what the @cache decorator in functools does for you. You can memoize the naive version without needing to actually write a specific memoized version.
No, this is what forty years of semantic dilution gets you.
https://en.wikipedia.org/wiki/Memoization

> In computing, memoization or memoisation is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls to pure functions and returning the cached result when the same inputs occur again. Memoization has also been used in other contexts (and for purposes other than speed gains), such as in simple mutually recursive descent parsing.[1] It is a type of caching, distinct from other forms of caching such as buffering and page replacement. In the context of some logic programming languages, memoization is also known as tabling.[2]

It includes citations so I won't bother repeating them. Memoization is caching, there is no confusion about this in the literature.

Because Wikipedia has never been wrong and doesn’t suffer from dilution.

The distinction I am making is that caching is not memoization.

> It is a type of caching, distinct from other forms of caching

Distinct. As in not to be confused with.

If I wrote "a square is a rectangle" would you also waste time shouting that rectangles aren't squares? I didn't write that all rectangles are squares. Nor did I write that all caching is memoization.

Even your quote doesn't support your quibbling. It says that memoization is distinct from other forms of caching. Which does not mean that it is not a form of caching.

EDIT: Removed a particularly rude bit.

All dogs are mammals, but not all mammals are dogs.
Global shared state. High memory overhead. Evictions mean that the beginning of your call can see potentially different values than the end, and most of us write code like this as if it is transactional. We cannot handle a user’s account being closed or upgraded to an admin in the middle of a transaction. So using a cache to keep refetching the user data instead of passing it with the call graph is fraught (and makes unit testing terrible).

As a concrete example, my coworker was trying to walk a DAG with many leaf nodes that were repeated (think dependency tree). He used a cache to do this. It had bugs (2 I knew, but 6 I found). It was one of three large calculations we made at startup, and it should have been the least of these. But it ate memory like candy and added a lot to our start time which was slowing down deployments.

So I replaced it with a DP solution that just looked like normal code, reducing startup memory by ~80 MB and start time by half. By walking the tree and making a list of all the nodes in dependency order and then processing them once, after the list was compiled. And for the microservices and dev tools where we ran this same code the startup time became negligible.

Part of the problem is that the library we used did not like being called recursively, so you had to dance around reentrancy bugs and that cost time and space. But a linearized run has no recursion.

Edit to add: I realize now that I’ve left the conclusion ambiguous.

His two main sins were one, that the cache lived beyond the useful lifetime, and two, that even with the cache some amount of post-processing was required. By unrolling the tree and running each calculation once I reduced the post processing by about 3/4 and climbing as the tree grew in width and height.

Memoization is deterministic caching, meaning that the structure is a perfect stand-in for a deterministic function. The "determinism" of the function itself is relative to some agreed upon lifespan, during which if you feed it some input, the corresponding output is guaranteed to always be the same (e.g. until the next reboot, the runtime of the main program, the span of a function call, etc). Within that timespan, the result of computations can be reliably stored and retrieved from the cache. When it's exceeded, the function's corresponding cache also goes stale and should be considered unreliable and to be discarded.

Note that in the case of memoization, the guarantee of determinism (and associated lifespan) is attached to the function, not the caching structure. Unlike some other caching approaches that rely on a replacement policy.