As the name implies, LibriTTS is meant for people (researchers, really) to develop text-to-speech (TTS) systems.
The point isn't that LibriSpeech isn't clean enough. Rather, it's that conversational speech is very different from read speech (which is based on written text). Everything has an effect, even planning of utterances (think: hesitations, "uhm", "uh"), turn taking behaviors (think: how speakers negotiate taking turns), how speakers self-correct, phonetic convergence (think: speakers adapting their speech to be more similar to that of their interlocutor), and so on.
The Common Voice data won't help with that, as it's read speech. It's far more expensive to collect conversational speech datasets, as transcription (or correction of automatic transcripts) involves a lot of manual labor.
The point isn't that LibriSpeech isn't clean enough. Rather, it's that conversational speech is very different from read speech (which is based on written text). Everything has an effect, even planning of utterances (think: hesitations, "uhm", "uh"), turn taking behaviors (think: how speakers negotiate taking turns), how speakers self-correct, phonetic convergence (think: speakers adapting their speech to be more similar to that of their interlocutor), and so on.
The Common Voice data won't help with that, as it's read speech. It's far more expensive to collect conversational speech datasets, as transcription (or correction of automatic transcripts) involves a lot of manual labor.