As a thought experiment, such content would likely be encrypted. The request size can give away the content type, but speech-to-text could be done on device, making it harder to guess the contents based on request size (assuming the identified speech would be significantly smaller compressed relative to audio).
Then correlate speaking with CPU usage. Processing audio is always going to have /some/ cost /somewhere/, and likely one we can detect for the time being.
If both of those are true then your concern is of the form "ANY party records audio data and sells it".
Which is why we don't want random apps having permanent mic access. Or permanent anything access. This is why data mining is bad, not just because the party doing the mining can get the data, but because they can sell it to third parties who combine it in unexpected ways to leak data that you really don't want to be public.
Totally agree. Although now might be a good time to raise the issue that "an obfuscated single-line mention buried in a 60-page clickwrap license presented in an 80x4 character window resulting in over 9000 pages of bullshit that you can't possibly realistically read does not 'consent' make".