Hacker News new | ask | show | jobs
by mox1 2272 days ago
How do you E2E encrypt a video stream and still allow adaptive bit rates?

If the server can't read (decrypt) the video, it cannot re-encode the video at different bitrates for different clients.

Or the Zoom client has to encode multiple steams and upload them locally...or it just downgrades to the bitrate of the slowest client...

You get shitty video and E2E encryption or good video and transport encryption.

6 comments

First of all, there are 2 different kinds of video calls: 1:1 and group calls. For 1:1 calls, e2e encryption doesn't cause any problem at all.

For group calls, it depends on how it's implemented, but many group calls are implemented using what's called a Selective Forwarding Unit (SFU) and the sending clients send multiple resolutions (either independent, called "Simulcast" or dependent, called "SVC"). In that case, the adaptation is done by the server in selecting which resolution to forward at any given time. This is fairly common practice in the industry. For example: https://github.com/jitsi/jitsi-videobridge and https://tools.ietf.org/html/draft-aboba-avtcore-sfu-rtp-00 and https://www.w3.org/TR/webrtc-svc/.

For those types of group calls, the server only needs to know the sizes of the various streams and which packet is for what stream. It does not need to see the decrypted media, so one can implement e2e encryption for such types of group calls. This is less common in the industry, but is possible. For example: https://support.google.com/duo/answer/9280240?hl=en

(I used to work at Google on WebRTC, Duo, and Hangouts, but now work on video calling at Signal).

That would seem to require significantly more upload bandwidth & compression capacity from clients, which is often not broadly available to consumers. I guess you could drop down to the lowest resolution only when sending to the service if you have a bandwidth challenge, but that seems less than ideal.
Lots of video conferencing systems already work this way (the SFU way). Compared to just sending the full resolution all the time, adding the smaller resolutions doesn't add that much bandwidth and compression because they are so much smaller.
Actually I'm calling BS on this. Zoom is literally adjusting frame rates at the single frame per second level. There is definitely upload pressure. SFU isn't going to be enough.
But a lot of video conferencing systems are designed for office environments, where you tend to have symmetric bandwidth.
I used to work in video and if I remember correctly there were I, P and B frames. You need I and P but the B frames are optional. So if some meta data is unencrypted the server can tell which packets are B frames and decide not to send them to slow clients. The actual data is still encrypted.
Sure. Then maybe don't claim that the service is e2e-encrypted?
I'd probably have each client encode one high quality stream that's targeted to be accessible to 90+% of clients, and a very low quality stream that's 5% of the bitrate of the high quality one. Low encoding complexity and adds a negligible amount to your upload bandwidth requirements. (Obviously if a client can't meet the upload quota for the highest quality, you max out at whatever they can do.)
Note: what follows is probably not how anyone actually does it. It is just an illustration that adaptive video is not incompatible with E2E encryption.

Suppose you have a block of 4 pixels, represented by 4 24-bit values. Instead of sending the 4 pixel values, send one 24-bit value that is the average of all 4 pixel values, and then the actual 24-bit values for 3 of the 4 pixels. The receiver can figure out the 4th pixel from those 3 and the average.

Send the average values and the groups of 3 discrete pixel values in logically separate streams, separately encrypted.

If something transporting this needs to lower the bandwidth, it can just drop the E2E discrete pixel stream, leaving just the E2E average stream. The receiver can then use that average value for all 4 pixels, in effect getting a video that is 1/2 the resolution both horizontally and vertically.

This scheme only gives you two rates: Full resolution and 1/2 x 1/2. No doubt you could do systems based on block sizes other than 2x2, and with multiple levels of averaging, that would give a wider range of fall backs.

Actual state of the art video encoding is, I believe, based on things like the discrete cosine transform, which represents an image as a sum of cosines of various various frequencies.

In this kind of representation the higher frequencies correspond to higher resolution detail in the image. I'd expect that you could do an E2E transmission scheme were you have different encrypted streams for different frequency ranges. Like with my far less sophisticated or clever 2x2 averaging scheme above, you could simply drop the streams for higher frequencies and the receiver would be able to reconstruct a lower resolution image, but unlike my 2x2 averaging scheme this would have much finer drops in resolution.

> Suppose you have a block of 4 pixels, represented by 4 24-bit values. Instead of sending the 4 pixel values, send one 24-bit value that is the average of all 4 pixel values, and then the actual 24-bit values for 3 of the 4 pixels.

So you still send 4*24 bits? what's the point?

> If something transporting this needs to lower the bandwidth, it can just drop the E2E discrete pixel stream, leaving just the E2E average stream. The receiver can then use that average value for all 4 pixels, in effect getting a video that is 1/2 the resolution both horizontally and vertically.

But you need knowledge of this protocol, so the sender is the only one able to do this. In that case just encode the downsampled resolution and send that, no tricks needed.

The way these video meeting services work is the participants all connect to the service's servers. Each participant sends their video feed to the server, which sends it on to the other participants in the meeting.

It's that server that wants to be able to dynamically downgrade outgoing feeds based on the bandwidth between it and the meeting participants, which can vary from participant to participant.

Alice, for example, might be on a symmetric gig fiber connection with consistent and low latency. Her client can send a high resolution feed to the server. Bob might have no trouble with receiving that, but Carol might be on slower, less stable connection, and need a lower resolution version.

If you aren't trying to do E2E encryption, you can handle this by having the server deal with taking the high resolution feed from Alice and generating a low resolution feed and then sending the other participants whichever is the best version they can handle. That works because without E2E encryption the server actually has access to the video, so it can do things like resample and re-encode.

If you are using E2E though then the only parties that should have access to the video itself are the meeting participants. The server should not have access to the video except in encrypted form.

The problem then is how to encode and encrypt a video stream in such a way that a server that is copying that stream between a sender and one or more recipients can alter a copy of the stream in such a way as to reduce the resolution even though it does not have access to unencrypted video?