I don't think computers would be able to understand aesthetics. It is a really high-level concept. Plus, I think deep-learning is a marketing mambo-jambo and does not perform much better than a linear SVM.
Then why are we using deep convolutional networks for state of the art vision and speech when we could just plug an SVM with handcrafted features? From what I know, error rates in vision dropped from 25% to less than 5% since deep learning. That's no trifle, especially at the higher end of the accuracy scale. It's very hard to conquer those last few percents.