Me too. While I recall Calculus class, I’ve been paid to operate at the algebra level for years. I think this walkthrough is a retread of differential equations, in particular the kind that attempt to approximate the result of the function as it approaches the limit of a point. I get lost at the bits involving dimensionality. I think it’s evident but it’s been years since I’ve read mathematics at this level.
It took me a moment to realize you were using "Algebra" in a pejorative sense to mean a level of mathematics beneath Calculus, rather than as "The study of composable operators on quantitative spaces." Algebra is a very deep subject, no less advanced than Analysis (The branch of mathematics that includes Calculus).
Not really. I can find or write one if you're very interested.
- Gradients (slopes) are incredibly powerful computationally. If you can compute a function, you can compute its slope with almost no overhead. Relating to physical objects, imagine walking at the top of an igloo (where it's pretty flat and you won't lose your balance) vs the sides of an igloo (where it's vertical, and you'll fall over or otherwise break something). For real-world functions. you can ask where you are (top, bottom, middle, ... of the igloo) _and also_ what the slope is (flat, steep, ...) with the second step happening nearly for free.
- The whole "technique" is that when the things you want aren't being found, walk in the direction of least badness. Gradients define that direction, and since they're nearly free computationally you can abuse them as much as you want.
- The rest of the "technique" is figuring out how to describe ordinary comp-sci data structures as differentiable entities. E.g., the simplest way to differentiate most discrete objects (like database insertions) is to insert them probabilistically and apply your derivatives/slopes to the likelihood of the insertion succeeding.
I wish all explanations made complicated concepts this manageable!
> the simplest way to differentiate most discrete objects (like database insertions) is to insert them probabilistically and apply your derivatives/slopes to the likelihood of the insertion succeeding.
I hope it wouldn't trouble you to go a bit further with this idea? Namely, the idea of differentiating an object or database insertion sounds like it shouldn't be structurally possible.
And why would there be a probability function for a db insertion succeeding? How could it fail - if the db was full or if there are too many requests?
Imagine the particular data structure you'd like to use for your database has if/else branching logic. A simple example thereof is some kind of a BinaryTree<Key, Value> type. Till your termination condition, if the key is bigger go one direction, else go the other direction as you walk down the tree.
The problem you'll find is that the derivative of the error with respect to the _key_ is always zero. Intuitively, nudging the key just a little bit won't change the query result. Imagine some kind of a step function which is 0 for negative inputs and 1 for positive inputs. The slope at every input (except for something interesting happening at 0 -- not a "slope" mathematically, but for your real-world problem you might care about what happens there) is exactly zero.
Phrased another way, the optimization problem you'd like to solve via derivatives and gradients is discrete, requiring combinatorial techniques instead of gradient descent to solve -- at least, that's true if choosing the right database key is an important part of the problem.
If you're able to somehow make the composite algorithm tolerant of probabilistic answers, by far the easiest tweak you can make is to replace if/else logic with a probabilistic if/else. When the key is a lot bigger, almost certainly walk right in the tree. When it's a little bigger, you have a nontrivial chance of walking left instead. When it's a little smaller, you have a bigger chance of walking left and a nontrivial chance of walking right.
Doing so allows you to represent the output as `p * f(right) + (1 - p) * f(left)`, which is differentiable in `p`. There's a bit of nuance in how you get the proper gradient if you only actually take one path (basically, some of the time you'll take the right branch, some of the time you'll take the left, and you want to re-weight the gradient whenever you take a given branch so that if you made that decision many times and averaged the results you'd converge to the formula above), but since you have something differentiable in `p` you're able to propagate that information back to the key you used to derive the probability.
The space where I most see people wanting differentiable databases is in creating giant ML thingamabobs. In that case, the high dimensionality of the value being stored can work in your favor. E.g., you could query the DB a few times, record the probabilities of each path, and use those to do a weighted average of the values you found. High-probability paths will dominate the result, and low-probability paths might still mix in a little interesting information.
That technique would likely not work if you wanted to, e.g., just drop in a differentiable database into your favorite web app and try to follow the gradients to find the inputs causing a certain bug. You'd need something fancier, probably not based on trees at all (at least not explicitly; you might use a tree internally to represent the data, but the conceptual computation wouldn't have those if/else step-function cliffs).
So there will be some key given to the database. And the key either gets results from the right branch or the left (at decision points, as opposed to leaves which would be the result itself). But if you hardcode the branching condition to be something like a greater than or less than, it's not very useful in finding the best result for the key.
But if you have a probabilistic setup, then the key has some chance of retrieving data from the right and from the left. Then we have
{expected goodness = p * f(right) + (1 - p) * f(left)}.
Where f is some objective function that tells us the 'goodness' of the resulting data the key got from going right or left. If you differentiate the expected goodness with respect to p you get f(right) - f(left). Which is how much the expected goodness will change if I increase the probability of going right.
Suppose the derivative f(right) - f(left) was positive, so going right was better than going left, then I can write some script to change p positively so that next time when I get that key the probability of going right is higher. That way we can optimize p for given keys.
Very interesting! I hope I got everything right; where I couldn't understand, I used gpt to help me out (when it told me f was a 'goodness' function, I had a breakthrough lol). The discrete way feels like clean, understandable logic while the probabilistic way with changing p values feels really messy, but I can see intuitively why the latter gives better results.
https://thenumb.at/Autodiff/