| I don't usually respond to old comments, so I don't know if you'll read this, but I hope I can encourage you to think more broadly about what "differentiable programming" means. Different fields have a different perspective on the same set of tools because those tools have different pathological cases in different areas. Context really matters. > Well, that was my point above: You can't really lump quantum programming together with probabilistic programming, as they are paradigms on different "levels" This distinction is not useful. If I write in a functional or logic programming language, it gets translated into imperative commands for an underlying architecture that is some mix of dataflow, event driven, automata-based, concurrent, etc... that is then further built on top of some physical atoms where an engineer worried about quantum effects. If I write in a quantum programming language, it will probably go through the same process for at least another 5 years. You might argue that quantum is somehow more dictated by the underlying physical model the way that people argue that imperative programming is closer to the physical world than functional programming. But the "level" doesn't change the usefulness of viewing all of these as "paradigms" worthy of study and analysis on their own terms with their own tools. At the level of studying a programming language, the "level" is a useful thing to be aware of for implementations and motivation but usually not for a theory. > Even more so, is seems that even defining semantics for differential programming is barely in its starting stages This is also not a useful distinction. OOP was also, infamously, a point of contention between the academic and outside worlds because it was developed and incredibly prevalent without a rigorous theory abstracting it beyond procedural programming. It became a "paradigm" despite that because there was a set of (informal) tools for reasoning about it on its own terms [1]. Likewise, differentiable programming has largely developed to formalize what makes programs written for machine learning frameworks different from programs written in the imperative/object oriented/functional language they are built on. Autodiff has mostly developed in practical usage, so the use cases are front-running the theory. There's increasingly hardware tailored to the execution model and software developers attempting to program it. There are approaches to problems like discontinuities that people have found solutions for without a rigorous theory justifying their use. There's a structure to why and how people are writing code for these applications as well as an operational theory for how to reason about it, but there's very little compositional, equational theory for these choices. To most people in machine learning, "differentiable programming" is just autodiff with pretty syntax because the term sprung from attempting to put what they are already trying to accomplish with that implementation on more solid theoretical footing as a computable model of a more general logic. That, hopefully, lets us more efficiently explore what a better domain-specific theory might be and if there are better execution models or logical frameworks. Autodiff itself is increasingly used as an umbrella term for many other methods of differentiation with different edge cases, so this is already happening in practice. To reduce "differentiable programming" to just its implementation ignores that aspect. It would be equivalent to equating machine learning to matrices. Not unreasonable computationally and not a terrible place to start for a theory, but deeply unsatisfying as a mature domain-specific theory. The main paper I linked [2] is not about autodiff at all. It's an attempt to establish a connection between differentiation in an analysis sense to models of (not otherwise obviously differentiable) program evaluation. The (unrealized) promise is that the centuries of understanding we have for the calculus of infinitesimals can be applied to the less-mature study of lambda calculus and nondeterministic computation. Papers like [3] cite it because it addresses (discrete) structures that analysis is less interested in and potentially provides a way to connect computation, calculus, and whatever it is that we're doing with machine learning. > Are you sure about that? I skimmed [1] as I wasn't hadn't read it and it seems to describe a rather restricted set of functions ("types are interpreted as vector spaces and terms as functions defined by power series on these spaces"), as there are many differentiable functions that cannot be defined as power series. Because it's a PL theory paper, it's not concerned with whether all differentiable functions can be represented, but whether all computable functions can be differentiated. And PL theorists are generally more comfortable than most to accept that most functions cannot be computed and choose a more restrictive model that enables more reasoning power. The category [4] is really the better place to start since it lets us also consider models that aren't vector spaces and [2] is best thought of as a prototype that left many gaps in the theory (for example the requirement on coefficients for convergence is wrong, though I can't remember which paper by Lionel Vaux proved this). It can be thought of as computable in finite instances, but it's unsound even when typed due to the zero term and result of sums. The quote you cite from [3] is easily misunderstood without that context. As a practice-focused paper, it cares very much about computability. Conditionals and loops are possible in [2] since it allows church numerals and fixed point combinators but it introduces a nondeterministic sum which is exponential in the number of evaluation steps and may diverge (doubly so since it's the untyped lambda calculus...) and is difficult to operationalize. That's what I meant by "wildly uncomputable". So, to them, [2] offers a useful mental framework for higher order features, but is not practical. The theory isn't there yet. The connections between logic, quantum, probabilistic, and differentiable programming can be understood by how the model treats the exponential modality (!) which converts the otherwise linear term to an analytic one. Differentiation decomposes this to give a sum of linear terms. Differential lambda calculus doesn't put any (more) structure on the sum. Probabilistic programming gives the added structure of a probabilistic sum where coefficients are weights. Quantum programming can be modeled via a Fock space [5] for (!) ([5] predates [2] so is not directly discussed as a differential category here). However, it's unclear what the right model for differentiable programming should be if we want something practical for the resulting derivative (much less antiderivative). Daniel Murfet et al [6] have some related work more directly in the context of machine learning. [1] https://www.cs.cmu.edu/~aldrich/papers/objects-essay.pdf
[2] https://www.sciencedirect.com/science/article/pii/S030439750...
[3] https://arxiv.org/abs/1911.04523
[4] https://ncatlab.org/nlab/show/differential%20category
[5] https://www.researchgate.net/publication/2351750_Fock_Space_... (sadly a researchgate link)
[6] http://therisingsea.org |
I did read it ;) Because I'm very much interested in this entire topic.
> I hope I can encourage you to think more broadly about what "differentiable programming" means
I'm trying to, but I find it hard. My stance was that differentiable programming seemes like this theory for which only a single example (namely autodiff) existed, as you also said ("autodiff has mostly developed in practical usage, so the use cases are front-running the theory"). But this entire comment of yours really clarified some things for me.
> The main paper I linked [2] is not about autodiff at all. >The quote you cite from [3] is easily misunderstood without that context
You finally convinced me to have a detailed look at this. Thank you for providing the context.
> Conditionals and loops are possible in [2] since it allows church numerals and fixed point combinators but it introduces a nondeterministic sum [...] and is difficult to operationalize. That's what I meant by "wildly uncomputable".
I think I may have misunderstood some of your previous comments (and perhaps vice versa) as it now dawns on me that you use a vocabulary than comes (I guess?) from PL theory and is very different from the one I'm used to, as a mathematician versed in analysis. I'll re-read them.
> Daniel Murfet et al [6] have some related work more directly in the context of machine learning.
I'm actually aware of Daniel Murfet but haven't read his work from last years. Did you have a specific paper from him in mind?