Hacker News new | ask | show | jobs
by viciousplant 1620 days ago
That "Unreliable gradients" part is what I worry about most, why would I use a product when it just "returning incorrect gradients"?
1 comments

I'd be interested to know whether that occurred pre- or post- Zygote.jl starting to use ChainRules.jl internally. Zygote's own rules used to be tested with a half-baked finite differences scheme, and it was easy to miss bugs. All rules in ChainRules are tested using the testing helpers in ChainRulesTestUtils.jl, which are very thorough.

Then, some bugs are not actually bugs. For example, the gradient of some function wrt to elements of a symmetric matrix is and should be different depending on whether the matrix is structurally symmetric or not. Or, when a function is not differentiable in the calculus sense, different ADs often have different conventions for what they will return. These cases come up all the time.

> For example, the gradient of some function wrt to elements of a symmetric matrix is and should be different depending on whether the matrix is structurally symmetric

Ugh. Yes and no, but mostly no.

A gradient is a collection of partial derivatives. The normal meaning of partial derivative with respect to symmetric matrix elements does not actually exist, because you can't vary them independently. This is actually a generic problem for any overparameterized constrained system. People still want to be able to use the standard tools of calculus though. You can reasonably do things like use Lagrange multipliers to do constrained optimizations. Or you can express a symmetric matrix in terms of only the n(n+1)/2 variables, treat them as a vector and get the gradient over that. (This is often what is done without actually describing the underlying method. This is the so-called "symmetric gradient"). But it's generally not what you actually want, even though it can lead to the same stationary points.

The best you can do is the unconstrained gradient restricted to the manifold of symmetric matrices (or whatever constrained object you are interested in). But even this is a (subtly) different thing than the gradient in the ambient space, even though it works in as large a set of cases as possible. Happily for symmetric differentials applied to symmetric matrices, the naive formulas of symmetrizing the gradient just work.

See, e.g. https://arxiv.org/abs/1911.06491 for some tracking down of the history and cogent analysis of what went wrong and https://saturdaygenfo.github.io/posts/symmetric-gradients/ for why it matters in practice (e.g. wrong and changing direction of descent for gradient descent, though still "downwards").

My point is I think simpler than what you're talking about, so let's replace "symmetric" with "diagonal." Suppose I have a structurally diagonal matrix, represented by a type `Diagonal` that can only represent diagonal matrices, and then a non-structurally diagonal matrix, represented by a normal dense matrix filled with zeros except on the diagonal.

I pass this as input to some function that uses more than just the diagonal and compute the gradient (here a matrix). In general, for the dense matrix, the resulting gradient will not be diagonal. Why should it be? If your function doesn't care if the matrix is diagonal, neither does the AD.

But what about for `Diagonal`? Well, the AD could handle this different ways. First, it could interpret all matrices as embedded in the dense, unconstrained matrices, in which case it would return a dense non-diagonal matrix. Or it could ignore that `Diagonal` represents a matrix entirely and instead return some nested structure that contains gradients of its stored fields, here the gradient of a stored diagonal vector. Similarly, it could represent this nested structure as just another `Diagonal`.

There are various reasons why an AD might do any one of these; neither is right or wrong. But note that the only case that gives you the same answer as the dense one is the one that promotes to a dense input, which is basically discarding all structure to begin with.

Most ADs don't ever have to work with matrix polymorphism, so their user bases don't need to think about this, but Julia has _many_ polymorphic matrix types, so Julia ADs have to take this seriously, and if new users naively pass a polymorphic matrix type as input to some `gradient` function, they might be perplexed at what they get back and think the AD engine has a bug. See the ChainRules talk at EuroAD 2021 for more details: https://youtu.be/B3bC49OmTdk?t=761

Ah, yes. That is a different point than the one I thought you were making.

Both the symmetric and diagonal cases are lucky ones where the constrained derivative is the same shape and size as the original (as the constrained manifolds are flat). Keeping it as a Diagonal in code then just works.

This is in marked contrast to "unit vectors", "orthogonal matrices", or similar non-flat manifolds. The derivatives in these cases although locally are the same dimension as the manifold, usually must be represented as something that spans the entire containing space once you deal with derivatives at multiple points.

Yes, precisely. It gets complicated very quickly. Cotangents for structural matrices in reverse-mode AD tend to wander away from the cotangent space at the corresponding point into the tangent space of the embedding for various reasons. Any time they wander away, there's the potential for huge performance losses. We use projections to routinely project them back to the correct cotangent space, but this at best a band-aid, and we could certainly use a better solution.

For the interested, this arises from the combination of implicitly embedding in the forward pass (e.g. multiplying an orthogonal matrix by a vector is implicitly the same as making the matrix dense first and then multiplying), where you would then need some sort of projection in the reverse pass. If your AD operates on the scalar level, you get this for free. But such ADs don't make full use of optimized BLAS routines. You can get that by writing custom AD rules, but then in the reverse pass you usually end up in some embedded cotangent space and need a projection to set things right. There are definitely trade-offs.

Thanks for the references. We're trying to write up a paper about all this work, and (to us) it seems obvious that differential geometry is the right framework.

What reverse mode AD talks informally about as "gradients" are cotangents, and if you are constrained to a submanifold of another manifold (such as the space of all symmetric matrices, embedded in the space of all NxN matrices) then your cotangent is a projection of that in the larger space. This reduces to the obvious thing for symmetric matrices, as you say, but there are less obvious AD-relevant cases (like the space of orthogonal matrices, or the space of "ranges", vectors with uniformly spaced elements) where getting it right with your bare hands is tricky.

I don't see any of the relevant terminology in these links, but they do seem to be thinking about related problems. Perhaps they have re-invented the wheel? Amused to read that standard cookbooks seem to have faithfully reproduced the recipe for a square wheel, for decades.