Hacker News new | ask | show | jobs
by daper 1405 days ago
I have some experience serving static content and working with CDNs. Here is what I find interesting / unique here:

- They are not using OS page cache or any memory caching for that, every request is served directly from disks. This seems possible only when requests are spread between may NVMe disks since single high-end NVMe like Micron 9300 PRO has max 3.5GB/s read speed (or 28Gbps) - far less than 800Gbps. Looks like it works ok for long-tail content but what about new hot content everybody wants to watch at the day of release? Do they spread the same content over multiple disks for this purpose?

- Async I/O resolves issues with nginx process stalling because of disk read operation but only after you've already opened the file. Depending on FS / number of files / other FS activities, directory structure opening the file can block for significant time and there is no async open() AFAIK. How they resolve that? Are we assuming i-node cache contains all i-nodes and open() time is insignificant? Or are they configuring nginx() with large open file cache?

- TLS for streamed media was necessary because browsers started to complain about non-TLS content. But that makes things sooo complicated as we see in the presentation (kTLS is 50% of CPU usage before moving to encryption offloaded by NIC). One has to remember that the content is most probably already encrypted (DRM), we just add another layer of encryption / authentication. TLS for media segments make so little sens IMO.

- When you relay on encryption or TCP offloading by NIC you are stuck with that is possible with your NIC. I guess no HTTP/3 over UDP or fancy congestion control optimization in TCP until the vendor somehow implements it in the hardware.

4 comments

Responding to a few points. We do indeed use the OS page cache. The hottest files remain in cache and are not served from disk. We manage what is cached in the page cache and what is directly released using the SF_NOCACHE flag.

I believe our TLS initiative was started before browsers started to complain, and was done to protect our customer's privacy.

We have lots of fancy congestion optimizations in TCP. We offload TLS to the NIC, *NOT* TCP.

Can I ask if your whole content can be stored on a single server so content is simply replicated everywhere or there is some layer above that that directs requests to the specific group of servers storing the requested content? I assume the described machine is not just part of tiered cache setup since I don't think nginx capable for complex caching scenarios.
No, the entire catalog cannot fit on a single server.

There is a Netflix Tech Blog from a few years ago that talks about this better than I could: https://netflixtechblog.com/content-popularity-for-open-conn...

> We offload TLS to the NIC, NOT TCP.

How is this possible? If TCP is done on the host and TLS on the NIC data will need to pass through the CPU right? But the slides show cpu fully bypassed for data

The CPU gets the i/o completion for the read, and is in charge of the ram address where it was stored, but it doesn't need to read that data...

Modern NICs use packet descriptors that allow you to more or less say take N bytes from this address, then M bytes from some other address, etc to form the packet. So the kernel is going to make the tcp/ip header, and then tell the nic to send that with the next bytes of data (and mark it for TLS however that's done).

A Micron 9300 Pro is getting rather long in the tooth. They are using PCIe gen 4 drives that are twice as fast as the Micron 9300.

My own testing on single socket systems that look rather similar to the ones they are using suggests it is much easier to push many 100 Gbit interfaces to their maximum throughput without caching. If your working set fits in cache, that may be different. If you have a legit need for sixteen 14 TiB (15.36 TB) drives, you won't be able to fit that amount of RAM into the system. (Edit: I saw a response saying they do use the cache for the most popular content. They seem to explicitly choose what goes into cache, not allowing a bunch of random stuff to keep knocking the most important content out of cache. That makes perfect sense and is not inconsistent with my assertion that hoping a half TiB cache will do the right thing with 224 TiB of content.)

TLS is probably also to keep the cable company from snooping on the Netflix traffic, which would allow the cable company to more effectively market rival products and services. If there's a vulnerability in the decoders of encrypted media formats, putting the content in TLS prevents a MITM from exploiting that.

From the slides, you will see that they started working with Mellanox on this in 2016 and got the first capable hardware in 2020, with iterations since then. Maybe they see value in the engineering relationship to get the HW acceleration that they value into the hardware components they buy.

Disclaimer: I work for NVIDIA who bought Mellanox a while back. I have no inside knowledge of the NVIDIA/Netflix relationship.

Just from reading the specs (I.E. real world details might derail all of this):

https://www.freebsd.org/cgi/man.cgi?query=sendfile&sektion=2

Given one can specify arbitrary offsets for sendfile(), it's not clear to me that there must be any kind of O(k > 1) relationship between open() and sendfile() calls: As long as you can map requested content to a sub-interval of a file, you can co-mingle the catalogue into an arbitrarily small number of files, or potentially even stream directly off raw block devices.

Does the encryption in DRM protect the metadata?
AFAIK no. The point of DRM is to prevent recording / playing the media on a device without decryption key (authorization). So the goal is different than TLS that is used by the client to ensure the content is authentic, unaltered during transmission and not readable by a man-in-the-middle.

But do we really need such protection for a TV show?

"Metadata" in HLS / DASH is a separate HTTP request which can be served over HTTPS if you wish. Then it can refer to media segments served over HTTP (unless your browser / client doesn't like "mixed content").

> But do we really need such protection for a TV show?

DRM may be mandated by the content owners. TLS gives Netflix customers privacy against their ISP snooping what they're watching.

> But do we really need such protection for a TV show?

What you watch can be a very private thing, especially for famous people.

No, and it doesn't protect the privacy of the viewer either!
FWIW, neither does the TLS layer: because the video is all chunked into fixed-time-length segments, each video causes a unique signature of variable-byte-size segments, making it possible to determine which Netflix movie someone is watching based simply on their (encrypted) traffic pattern. Someone built this for YouTube a while back and managed to get it up to like 98% accuracy.

https://www.blackhat.com/docs/eu-16/materials/eu-16-Dubin-I-...

https://americansforbgu.org/hackers-can-see-what-youtube-vid...

Did TLS 1.3 fix this with content length hiding? Doesn't it add support for variable-length padding that could prevent the attacker from measuring the plaintext content length? Do any major servers support it?