The discussion of Zoom caused me to look into what FaceTime does. AFAIKT, there is no protection against MITM attacks by Apple against FaceTime. So in the end we have to trust Apple much like we have to trust Zoom. Of course one of those parties might be a lot more trustworthy than the other but the point still stands. For the sort of protection that e2e can provide the user has to have the ability to verify who they are talking to. Both Apple and Zoom are misleading by omission, one more than the other.
This article is rather badly written, it puts the most important part in the second to last sentence, almost as an afterthought.
Plus Caesar encryption is a bad example, since it's so bad that you can actually compress the ciphertext using standard lossless compression algorithms.
I agree there are complications that e2e adds that makes it pretty much unfeasible for something like Zoom. But why is client-side compression not a solution to the problem statement here?
Assume you are on a call with three other people. Alice has a solid connection and you can deliver best quality video to her, Bob is on a low bandwidth connection from home and you will need a better compression ratio for him because he can't handle full bandwidth, while Carol is on a shit mobile connection that has high latency, low bandwidth, and drops packets. For each of these you would want a different encoding scheme and possibly very different packet sizes, but if you are sending it from your desktop at home you would basically be sending almost 3X the amount of data to handle the three different streams (plus the additional CPU for all of the different compression steps) -- Zoom sends one stream to their servers and then re-broadcasts at different compression levels and different packet schemes to give the best result for each user.
Justifiable, but in no way shape or form is this E2E encrypted and any company that makes this claim is committing fraud.
The tl;dr of "Zoom E2E Encryption Explained Like You're 5" is rather simple: "It isn't E2E encrypted"
"Explained Like You're 5 and don't care about the way things actually work".
Read about Scalable Video Codecs (H.264SVC, HEVC, VP9, AV1), SFU vs MCU architectures and then try again. The real reason end-to-end encryption is hard with SFU-mediated multiparty is very different from what is being described.
Also, there are possibilitys to archive this while still beein usable. Apples FaceTime does support E2E encryption for video calls.