To your second point I have a sneaking suspicion whatever is recommended in this very thread will suddenly jump in its estimation as a “classic.” History is made up as it goes along!
Well, GP's Neural Smithing is a solid example. There is nothing wrong with it, it is surprisingly well written and correct for something published before the millenium.
Take a look at the Google Books preview (click view sample). The basics are all there, intro to biological history of neural networks, backpropagation, gradient descent, and partial derivatives etc. It even hints at teacher-student methods!
The only issue is that it missed out on two decades of hardware development (and a bag of other optimization tricks). Modern deep learning implementations requires machine sympathy at scale. It also doesn't have any literature on autoregressive networks like RNNs or image processing tricks like CNNs.
https://books.google.com/books/about/Neural_Smithing.html?id...
Take a look at the Google Books preview (click view sample). The basics are all there, intro to biological history of neural networks, backpropagation, gradient descent, and partial derivatives etc. It even hints at teacher-student methods!
The only issue is that it missed out on two decades of hardware development (and a bag of other optimization tricks). Modern deep learning implementations requires machine sympathy at scale. It also doesn't have any literature on autoregressive networks like RNNs or image processing tricks like CNNs.