Hacker News new | ask | show | jobs
by pal_9000 2251 days ago
Jitsi is on a roll! By the way, Does anyone know the challenging part of e2e in video chats? Thinking out of intuition, it would be keys are exchanged during handshake and binary data is decoded on the clients? I'm just wondering how could Zoom miss it?
4 comments

Actually, WebRTC was designed from the start to be end to end encrypted as well as peer to peer. Even designed in a way that you can't turn off encryption even if you wanted. This choice was done during the early design period of WebRTC which coincided with the Snowden revelations.

However, over time people who built WebRTC systems (like jitsi or zoom) realized that end to end encryption makes multi-party chats hard. The basic issue is that you don't want to burden end points with sending the video streams to all users. Think about tens or thousands of users.

So the way they used WebRTC was changed. The mandatory end to end encryption was circumvented by connecting to a central server. This entity then could forward streams to as many people as you want. Also, most times people don't need a HD video of someone. A small thumbnail is enough, think of a screen filled with 10 thumbnails of your users. The downsampling of the video stream can happen on that central box. The downsampling was enabled with simulcast mode, but even then it still requires more bandwidth, while insertable streams will enable WebRTC applications to put a second layer of encryption over the encryption provided by the user angent. That second layer can then reach to the actual other end of the communication, as key management is exposed to the entities.

The sad twist in this story is that the desire to make the encryption very inflexible so that it's surely not circumvented made it impossible for people to amend it so that SFUs work... leading to people disabling it altogether.

> end to end encryption makes multi-party chats hard. The basic issue is that you don't want to burden end points with sending the video streams to all users. ... The mandatory end to end encryption was circumvented by connecting to a central server.

Dumb question: Can't you choose one video encryption key K, use ten thousand individual secure connections (with different keys) to share K with all the other users, then encrypt your video with K and let central servers mirror it all they like? (Could even have other clients do some mirroring, bittorrent-style.) Regarding downsampling: if the client has only enough CPU and bandwidth to put out one stream, then, yeah, that doesn't work very well, but otherwise you could put out multiple streams (all encrypted with K) of different quality.

So from what I understood, the clients had no way of accessing an encoded WebRTC video frame before it was sent over the network. Only with the new Insertable Streams is this possible. So they kind of plan to do what you say, encrypt it "manually" on client and let the router mirror it. Sharing the key as you proposed still dictates that you can p2p connect to all participants. Sadly that's not possible in all NAT situations and you would still need a TURN server for the clients to meet, having a again a central point.
Not a dumb question at all. That’s exactly how this would be sensibly designed. Does WebRTC’s built in E2EE encryption not do this?
The challenging bits are:

* How do you handle non-encrypted participants (e.g. people dialling in from the PSTN?)

* Up until this week, you haven't been able to intercept WebRTC streams for doing E2EE if running in-browser. (Zoom however doesn't use WebRTC, so they don't have this excuse).

* Do you have a safe place to store the keys, and manage user identity?

I cannot think how you could possibly do the key exchange securely and automatically, if you want to give a link to someone and have it "just work".

If all you have is the URL, then the server sees the encryption key.

Video conferencing also rarely has users register. So there isn't a way to validate users either. And even if they did register and users didn't care about the extra friction, multiple devices means either the server stores your private key, or you have many keys which is much harder to verify.

E2EE is much easier on phones, which is why Signal is so good. The identity is your phone number, and you can only have one key associate with your number. That key never leaves your device. Conceptually easy.

Video conferencing has none of those advantages, and I don't know how you would make it conceptually easy for users without reducing the security.

I don't know much about the broader context, but to this part:

> If all you have is the URL, then the server sees the encryption key.

Not necessarily. It's possible to put the key after a "#" in the URL, which allows client-side code to use it without sending it to the server. This technique is used at ZeroBin, among other places. (Edit: This is actually done in the video in the OP as well.)

You could still make the phone a primary device and allow it to perform the key agreement and pass control off with a QR code, but that is complicated and leaves open the question of who is allowed in this conference.

So perhaps you just give up on persistent identity: just have an unencrypted waiting room, the organizer and their delegates can approve people in the waiting room to enter the encrypted conference.

Do you mean kind of like how authentication is sometimes handled on input/UI constrained devices (e.g. TVs), where a message could be played to callers, asking them to enter a one-time code at a particular website?

On the face of it, this could work quite well for most people.

The hard part about e2e is that it only gets you anything if none of the parties involved are being surveilled after decryption. This is easily within the capabilities of any adversary worth worrying about.
e2e means you don't have to worry about transparent 3rd parties keeping secrets for you. It prevents casual abuses of your privacy and sets the bar for violating it (deserved or not)