| HN Mirror

As throwawaymath mentioned, it's not too hard to find papers written by them, but I'll highlight some papers that I found particularly interesting.

Many optimization problems that arise in machine learning can be viewed as minimizing a particular Bregman distance subject to affine constraints (I'll call these "Bregman distance optimization problems").

In [1], the authors develop some quite general and widely applicable convex analysis type results for Bregman distances and use those results to give the technique of "Auxiliary Functions" which can be used to derive and prove the convergence of algorithms for particular Bregman distance optimization problems.

In [2], the authors show that finding the optimal parameters of two quite different approaches of binary classification, AdaBoost and Logistic Regression, can both be simultaneously viewed as the same Bregman distance optimization problem (with slightly different initial parameters). They then present several algorithms for solving this unified problem (and thus, for optimizing both AdaBoost and Logistic Regression), and prove the convergence of their algorithms with the method of Auxiliary functions developed in [1]. This paper was the first general proof of convergence for AdaBoost which was proposed by Yoav and Schapire (one of the authors of [2]) and earned them the Gödel prize in 2003.

If you don't mind me plugging in some expository work of mine, I wrote an essay ([3]) giving an introduction to Bregman distances and is pitched at a lower level than [1] and [2]. The end of the first chapter discusses in particular the general Bregman distance optimization problem, setting it up in the same framework as [1] and [2], and the final chapter presents the algorithm and proof of convergence originally given in [2] but hopefully with a slightly simplified presentation due to focusing only on the Logistic Regression case of the algorithm.

In case you want to play around with it, I have implemented the algorithm from [2], as well as a related algorithm from [4] which incorporates L1 (Ivanov) regularization into that algorithm, both being available at [5]. The middle chapters contain brief discussions on the relation of Bregman distances with Exponential Families, and on Generalized Linear Models, which are relevant to the overall purpose of the essay but not so closely related to the content of [1] and [2] and may safely be skipped.

If you generally enjoy the types of problems discussed in [1] and [2], you might also enjoy some of these papers: [6] [7] [8] [9].

--------------------------------------

[1] - "Duality and Auxiliary Functions for Bregman Distances". Stephen Della Pietra, Vincent Della Pietra, and John Laﬀerty, 2001.

[2] - "Logistic Regression, AdaBoost and Bregman Distances". Robert Schapire, Michael Collins, Yoram Singer, 2001.

[3] - "Optimisation of Bregman Divergences". Ragib Zaman, 2018. https://github.com/RagibZaman/mathematical-optimisation/blob...

[4] - "Bregman distance to L1 regularized logistic regression". T. Huang and M. Gupta, International Conference on Pattern Recognition, 2008.

[5] - Notebook containing implementations of Bregman divergence based algorithms for Logistic Regression from [2] and [4]. https://github.com/RagibZaman/mathematical-optimisation/tree...

[6] - "Legendre functions and the method of random Bregman projections." H. Bauschke and J. Borwein, Journal of Convex Analysis, 1997.

[7] - "Inducing features of random ﬁelds". Stephen Della Pietra, Vincent Della Pietra, and John Laﬀerty, 1997.

[8] - "Statistical learning algorithms based on Bregman distances". Stephen Della Pietra, Vincent Della Pietra, and John Laﬀerty, 1997.

[9] - " A maximum entropy approach to natural language processing". Adam L. Berger, Stephen Della Pietra and Vincent Della Pietra. Computational Linguistics, 1996.