Hacker News new | ask | show | jobs
by quackerhacker 2844 days ago
So I have an affection for Discord...so I appreciate posts like this.

>"Using the WebRTC native library allows us to use a lower level API from WebRTC (webrtc::Call) to create both send stream and receive stream."

So I'm gathering that discord's voice servers receive multiple persistent connections, then compress the audio streams for delivery to each end user. THIS part is where I can't imagine the on-the-fly cpu usage. Each client's receiving compression needs to also negate their own audio to prevent an echo effect (no point to hear your own voice), but it also means separate compression streams per user.

>" All the voice channels within a guild are assigned to the same Discord Voice server."

I imagine this helps significantly with I/O in converting live streams into 1 stream per end user. I've dealt with video compression (only in ffmpeg) and live syncing time stampings, and I can say from experience that, this is no easy feature. I understand this is audio streams (so lower overhead), but still the persistent voice server needs to handle the incoming connections, web socket heartbeats (negligible), compression (high I/O), and deliver the streams (high memory usage too).

I'm impressed, but would love to hear the specs on the media servers and their DL/UL speeds. My old setup to deliver live video (in sync and compressed) was 6 mini-itx's, 4GB of ram per board, and i3's...my bottleneck was my isp, which I solved with multiple docsis modems and an internal switch (each board had 2 ethernet ports).

1 comments

We don't transcode audio/video on the server. Each stream is processed and muxed by the client. The server is merely relaying rtp packets and tagging them with a given ssrc per peer. The client does the rest of the work.

The bulk of the user-space time on the SFU is spent doing encryption (xalsa/dtls). We also avoid memory allocations in the hot paths, using fixed-size ring buffers as much as possible.

Additionally, we coalesce sends using sendmmsg, to reduce syscalls in the write path: (http://man7.org/linux/man-pages/man2/sendmmsg.2.html)

I posted some about the specs here: https://news.ycombinator.com/item?id=17954163

If Discord is basically proxying the raw packets from one client to the others, isn't that wasted bandwidth (for discord, not the clients). I understand from the post that the goal would be to mask the ip of the users, to shoulder user privacy and the ddos vector. Kudos on silence detection to save overhead.

So video w/audio broadcasting has to be compressed client side, then proxied through Discord's media servers, to the end user's. That's pretty smart...I just wished that I could send my raw stream to a LAN host so I could offload the compression, and allow my LAN host to provide delivery (I'm a nitro user).

Would rather waste bandwidth than CPU cycles in this case. Would take way too much CPU time to mux audio streams together server-side, and then recompress. (Means we have to buffer data for each sender, deal with silence, deal with retransmits and packet drops, have a jitter buffer, etc...). No way we'd be able to hit the # of clients we want per core with that overhead. Our SFU's are intentionally very dumb for this reason.

Also, muxing server side means we can't do things like per-peer volume and muting, without having to individually mux and re-encode for each user in the channel depending on who they have muted and the volumes they have set per peer (which would explode CPU complexity even further).

So, in this case, bandwidth is cheap, let's use (and waste) some, in an effort to simplify the SFU, and also, make it more CPU efficient. Default audio stream is 64kbps (or 8 KB/sec), per speaking user.

More bandwidth less cpu and complexity