In principle a shallow NN (1 hidden layer) can approximate any function. But it has a tendency to overfit and just "memorize" the inputs. The basic idea of adding additional layers, is that the early layers can learn very low-level features of the data, and later layers combine the low-level features into higher-level features. This tends to make the models generalize well.
A standard example is for a face detection algorithm. The first layer will do edge detection, the next layer will combine edges into corners and simple shapes, the next layer will maybe use those shapes to look for features like eyes, noses, mouths, etc., and then the next layer will maybe combine those features to look for a whole face.
I am no expert but I think it allows for a higher order function to be arrived at. An example would be the output of a simple net, where the output is a linear combination of features. This would be extremely shallow and while this will work for some things, there are going to be some instances where this doesn't capture nuanced scenarios.
in a shallow net, maybe college student selection based on sat scores gets a heavy weight/low threshold/whatever. in a shallow linear combo, this will likely always carry a large weight.
in a deeper net, it might be able to learn that SATS are a great predictor except for when X Y Z or some combination of those are some particular value, in which case it might be wholly irrelavent. The deeper it is, the longer it will take to train, but the more it can handle exception cases/trends and approximate reality
No one really knows, there was a paper by Max Tegmark on HN yesterday with some new ideas and results, but I haven't had time to read it yet. http://arxiv.org/abs/1608.08225
The other responses to your question in this thread are as good I as could give you, but I would feel like I am recounting ideas that may be true but for which there is little evidence.
In general, it allows a better approximation of the solution function for far less hidden neurons. Sure, you could get arbitrarily close using a single hidden layer, but that hidden layer might need to be unfathomably large. Same idea for network topology in multilayer nets - a network could eventually learn to set a lot of the weights to zero, but training is a lot faster and more effective if you know a good problem-specific topology to start with. Deep nets make problems more tractable. Recurrence is the real game-changer, since then you've moved from non-linear function approximators up to Turing completeness (at least over the set of all possible RNNs).
Think of each layer as an opportunity to perform a level of abstraction or categorization. Concepts are built out of smaller concepts which are built out of smaller concepts; lots of layers allows lots of hierarchy in the concepts.
The first layer might recognize and respond to pixels in particular parts of an image, the next layer will group certain of those pixels-responses together into an abstraction you might call a "line", the next layer will respond to certain groupings of lines and add a level of wiggle-room regarding where the lines are in the image, and the final layer will judge whether a combination of groupings constitutes the letter "A". Or at least, if you spent a bit of time poking at a deep network, giving it slightly different inputs, you might eventually conclude that this is what the layers were doing.
Without layers, you're basically just approximating a simple function or mapping with one level of abstraction.
A standard example is for a face detection algorithm. The first layer will do edge detection, the next layer will combine edges into corners and simple shapes, the next layer will maybe use those shapes to look for features like eyes, noses, mouths, etc., and then the next layer will maybe combine those features to look for a whole face.
I wrote a more detailed answer here:
http://stats.stackexchange.com/questions/222883/why-are-neur...