| I think an aspect of deep learning that is often overlooked is that it is still not clear how much of current algorithm performance is defined by local "obsession to detail" vs global "understanding" of the subject matter. The "tiramisu" layers here are an interesting example of this: they are built on dilated convolutions and one of their main selling points is that they can do calculations on a pixel-by-pixel basis, as opposed to standard methods which use are forced to compress information along the way through pooling/strided convolutions (basically taking multiple pixels and summarizing them into fewer features). Even Wavenet, which has had a few posts on HN, is in some sense a compromise: a few years ago people were obsessed with the idea of forcing RNNs/LSTMs to summarize the inputs they've seen to date and learning long-range dependencies through a hidden layer that would hopefully be interpretable. Mostly, though, models seem to be very happy staring at the last few inputs...recent paper showed they basically seem to mostly work as n-gram models with relatively small n [0, 1]. The compromise is Wavenet, which can only act on a context of about ~300 ms at a time [2], which more or less precludes learning long-range structure, but doubles down on this inferential bias and runs this tiny audio context through many layers and tens of millions parameters to outdo state-of-the-art models that need to "lose information" as they process it. To your point, I would argue that most real-world applications are more interested in "global" interactions and an ability to "understand" signals rather than expending tremendous resources on every tiny detail that is observed. I'm not sure that this is the typical solution neural networks are going to converge to. Partly I think this is motivated by hardware: GPUs are unbelievably powerful computing machines and they make convolutions look unbelievably attractive. Some researchers have 8 or more of them so you don't have to worry about obsession over detail. The other part of the tiramisu models, DenseNets, basically glue together layer after deep layer after deep layer... I think it's an architecture that is an obvious idea but from my understanding of GPUs, layer concatentation is an expensive operation, and people wouldn't have bothered designing the architecture a few years ago because they wouldn't have been able to run it on anything other than a Titan X from the future. Probabilistic models have been floated as a way to increase global coherence of the information models are learning. In my experience, however, when trying to combine models with probabilistic and convolutional components (like [3]) the neural network's first order optimization promotes obsession over details vs understanding data well enough to be able to handle any uncertainty. To some degree I think this is also what we see in the deep learning community: why do second-order optimization and move toward new paradigms when you have 8 GPUs and can get a 0.1% improvement on the latest image benchmark? [0] Blog post on "Frustratingly Short Attention": https://martiansideofthemoon.github.io/2017/06/28/short-atte...
[1] From https://arxiv.org/pdf/1703.08864.pdf : "It shows that the memory acquired by complex LSTM models on language tasks does correlate strongly with simple weighted bags-of-words. This demystifies the abilities of the LSTM model to a degree: while some authors have suggested that the LSTM understands the language and even the thoughts being expressed in sentences (Choudhury, 2015), it is arguable whether this could be said about a model that performs equally well and is based on representations that are essentially equivalent to a bag of words."
[2] https://arxiv.org/pdf/1609.03499.pdf
[3] "PixelVAE": https://arxiv.org/pdf/1611.05013.pdf |