I don't think these systems' purpose is to fool humans. Tasks that test whether a system can fool people are simply a good way to evaluate the performance a system. If a speech synthesis fools people into thinking a real person is speaking, that means the speech synthesis is really good. You might say it's not important that a speech synthesis sounds perfectly human but our speech perception evolved to be optimal for human speech, so it's likely that any deviation from that makes the signal harder to process.
Because with deep learning we quite recently got a new tool (someone figured out how to use GPUs for training) that lets of do a lot of those imitating things we couldn't do before.