You can't really do it because of licensing reasons. One cool thing Common Voice brings to the table, besides all the fantastic data, is the licensing.
Which, it must be said, isn't always as bullet-proof as it could be. There's a not insignificant amount of transcription (or pronunciation) errors in those datasets and Mozilla might want to find ways to increase the quality of already-released data over time.
Fair use isn't a feature of copyright in every juristiction, which could make this a less than useful idea trying to create a global corpus of speech data.
This is incorrect. Pretty much every state of the art model uses copyrighted data. This is considered fair use and it has never been a problem outside of concern trolling.
As a lot of that cc text is automatically generated it seems like you’d just be creating a clone of other software, which might be an intellectual property issue.