Hacker News new | ask | show | jobs
by zbobet2012 1016 days ago
This is interesting, but it's key to remember that audio and video codecs have three primary constraints in there design: compression efficiency and encode, and decode complexity

Simply addressing compression efficiency without considering the other constraints makes for an impractical codec.

If your encode complexity is too high, then you can't handle real-time (live) use cases, and your compute requirements may make practical use for large libraries like social media impractical.

If your decode complexity is too high mobile devices will suffer from severe battery drain or simply won't be able to decode, even in hardware. Dedicated hardware may also have to much complexity to implement in a cost effective way.

2 comments

It can be worth deprioritizing encode complexity since the majority of audio and video streaming is non-live content. Youtube, Netflix, TikTok, Instagram, and so on can all benefit even if encode is slower than realtime. Obviously there are cost-benefit considerations here but they are doing AV1 so they are willing to accept some hit on compute costs as a trade for bandwidth costs.
A lot of that footage is originated on smartphones, GoPro, drone cameras, etc. where hardware and power limitations do not allow one to run expensive encoding algorithms.
So the smartphone does a crummy H.264 encode that's bit-expensive but power-efficient, then the YouTube server saves the bits a million times over by transcoding to AV1. There's still room for at least 2 codecs on the Pareto curve.
Yes, exactly.
Lyra is a real-time neural speech codec from Google - I don't know if they use it in the Pixel line for call compression, but they certainly could.

Interestingly, I had the idea of using their open-source version as a vocoder for a light-weight TTS model. It did work - as in, it produced intelligible speech - but with very rough audio quality on the validation set. No matter what I tweaked, after 1-2 epochs the validation error would always diverge from the training error, which to me suggests considerable redundancy in the compressed representations (i.e two clips of perceptually similar audio can decode to different representations, so the TTS model has difficulty learning the underlying loss surface). I suspect there's still a lot more entropy to be squeezed out of it The Encodec authors encountered something similar, compressing their codec by a further 40% by simply layering a language model over the top.