This type of concatenation I first saw in Alex Graves' work on "Generating Sequences With Recurrent Neural Networks", including his unpublished TTS demo [1]. Biasing with part of another sentence (as in handwriting) can possibly improve style in TTS as well.
We followed this approach in char2wav [2], but "voice cloning" has come much farther in my opinion [3][4][5]. There's a lot of relevant research on techniques for this beyond concatenating indicators or embeddings, if people are interested in the research side of this technology.