Piggybacking on this, I instead recommend an introductory ML book like Bishop or Murphy, a statistical ML book like Mohri or Shai Shalev-Schwartz, and a textbook on nonlinear optimization, convex or otherwise.
The jump from classical machine learning to deep learning is not far if you have a good understanding of first principles.
I think that SSS is a little easier to follow, while Mohri is more complete.
It was helpful to me to have both, as details skipped over in one proof were often highlighted or better explained in its corresponding description. For someone whose training was not theoretical computer science, SSS left me with a better understanding most of the time.
The jump from classical machine learning to deep learning is not far if you have a good understanding of first principles.