The short answer: Nonlinearity isn't just important for deep neural networks - nonlinearity is deep neural networks. Without a nonlinear element in between linear layers, the "deep" is meaningless - a "deep linear network" is precisely equivalent in power to a simple, one-layer linear classifier[1]. (Because if all you're doing is a bunch of linear transformations, you can't do anything you couldn't do with a single linear transformation.)
As far as I can tell, your understanding that this is just a linear function is precisely correct - which means it can't do anything that a simple linear classifier can't.
[1] I suspect that the reason this has multiple layers is because of the physical constraints of the system that prevent a single layer from being an arbitrary linear function of the inputs. The light from a specific pixel can only get effectively diffracted so far, so they need to cascade multiple layers to make sure that all the inputs can contribute to all the outputs. It still ends up being equivalent to a single linear transformation.
True, but all the nonlinear optical effects I'm aware of only really start to matter at very high intensities - so wouldn't really be applicable to the kinds of scenarios they envision, like directly feeding it images seen from ambient light.
Uhm, speed of light differences in a modified crystal lattice are constant nonlinearities reasonable to produce. They do not need high intensity light, but they would need additional circuitry for scaling. Plus the network would have to work on phase angle and not magnitude. Mostly Kerr effect (high voltage) and cross wave polarization (e.g. given Pockel's cell) are useful there.
As far as I can tell, your understanding that this is just a linear function is precisely correct - which means it can't do anything that a simple linear classifier can't.
[1] I suspect that the reason this has multiple layers is because of the physical constraints of the system that prevent a single layer from being an arbitrary linear function of the inputs. The light from a specific pixel can only get effectively diffracted so far, so they need to cascade multiple layers to make sure that all the inputs can contribute to all the outputs. It still ends up being equivalent to a single linear transformation.