Hacker News new | ask | show | jobs
by nrzuk 4342 days ago
Could someone explain to me (it's been a long day, and I'm no storage expert!) how these are useful, wouldn't you need a whole bunch of these at each location they are to be installed?

Just basing this on my own experience I have 12 drives in RAID which have a fairly substantial sequential throughput. Start multiple streams of high bandwidth videos and the maximum throughput from the drives drops sharply due to the random reads.

I would have assumed that having even 100 people streaming from a single box would be an interesting challenge.

5 comments

These things can pump out 12Gbit during peak viewing, which is enough for about 4,000 avg. 3 Mbit streams.

If they can cram in 256GB of DRAM, that's enough room for a 64MB buffer per stream, or about 170 seconds worth of streaming. Now you only need to be able to fill those buffers at a rate of about 24/second.

I'm assuming whatever file system is on the disks uses a massive block size, so the number of seeks you'd have to perform to pull 64MB off is probably pretty low. Eight? Sixteen? Even if it's the latter, that's only 384 seeks/second, which you could very plausibly do striped over only a half dozen disks, and the device presumably has many more.

> I'm assuming whatever file system is on the disks...

Netflix uses UFS+J on FreeBSD 10.

Here are some notes from their talk at NYCBSDCon 2014:

- 400,000 stream files per appliance.

- 5,000 - 25,000 client streams per appliance.

- 300 - 500 streams coming off each disk all the time.

- Attempt to buffer 1MB ahead, but caching is futile.

- Result is completely random disk workload.

- System becomes limited by disk latency and CPU load.

Video here: https://www.youtube.com/watch?v=FL5U4wr86L4

Makes a little more sense now, thanks :)
I'm no expert either, but if my napkin math is right, fully utilized in a best-case scenario (no network overhead, etc.), those disks would peak at around 30-40 MByte/s (36 disks, 10 Gbit NIC).

I'm not sure if the I/O is really all that random. Theoretically, it's all sequential, but might effectively hit the disk like random I/O because of concurrent streaming sessions, varying network speed and so on. My guess is you can get pretty close to sequential I/O with some simple means like using large block sizes in your RAID, well-tuned readahead and Native Command Queuing.

The new versions are all SSD.
There's a bit of magic to it, turning the OS to work well in conjunction with the hardware you have, but it's possible.

* Use 15k RPM SAS 12Gb/s drives instead of cheap consumer drives,

* Use 50 in a chassis of them instead of 12.

* Use RAID 0 instead of RAID 5 or 6 - I'll be the quoted storage space is raw, not post-raid.

* Have multiple copies of a show on disk, as the article states happens.

* Optimize for a read-heavy workflow during peak hours, (eg, mount noatime).

The devil is in the details, but what I listed above is probably a good starting point. The article states a data rate of 3GB/hour, which is only .83 megabytes/second, * 100 streams is only 83 megabytes/second, which is easy for the above configuration. Hell, I'll bet the above configuration could do 10,000 streams if the data rate is 3GB/hour with no issues given Netflix's peak read-only workload.

A 15000RPM SAS drive would be an _extremely_ poor choice for this type of workload. Power hungry, hot, extremely sensitive to vibration, very low data density, and, of course, ridiculously expensive.
What? While I can't argue with them being ridiculously expensive (because they very much are) why would them being power hungry, hot, and extremely sensitive to vibration even matter if they are going to be rack-mounted into an ISP's data center, with redundant power-supplies and a data-center grade cooling system to match?

The only possible issue of the ones you raised is low data density, but all engineering is a trade-off, 15k rpm drives get you better seek times, which would generally lead to the ability to support more viewers in a single box - not working for Netflix, I don't know how many users they want to support per-box.

It's totally possible that between the number of streams they want to support, and the total storage in the box that they've gone with SATA drives, but to pretend that 15k RPM SAS drives are "an _extremely_ poor choice" for an enterprise-grade storage system would be to ignore the fact that Netflix is making an enterprise-grade storage appliance - and that on the top-end, those appliances commonly use SAS drives.

I will double-down to claim that 15krpm SAS drives are a bad choice for any application, and they are only used as a bandaid for marginal improvements on irredeemable system designs.

To address your points individually:

1) Power hungry. When you add in conversion and distribution and cooling, every watt consumed by the computer is consumed again by the datacenter infrastructure. Power costs money.

2) Hot is just the corollary of 1). Hot is also the enemy of density, and this box is very dense.

3) Sensitive to vibration. If you aren't intimately familiar with this fact then you aren't getting the performance you paid for from your 15k disks. To achieve their spectacular claimed seek times then need very careful mechanical design of their enclosure. Much more careful than racking up one of Netflix's boxes in a rack with other random vibrators.

4) Density. To get the space Netflix is using here, you need 5x more expensive 15k disks because they top out at 600GB and the ones people actually use for these workloads have 3TB.

A smart read-ahead strategy obviates the need for shiny seek time specs. For any given stream you could read ahead by 32MB or whatever. Now you've made seek time irrelevant. Put lots of RAM in the machine and you're done at a tiny fraction of the capital and operating costs of 15k disks.

+1, in general. To adda bit of content:

1) Power Hungry, the total power envelope of a rack position is the limiting issue. You can fit 16-20 disks per RU no problem. The constraint on total density is your 5kVa or 10kVa power budget. Watts matter.

3) sensitive to vibration, be very very quiet http://www.youtube.com/watch?v=tDacjrSCeq4

4) Density, actually 4TB is hitting the sweet spot for $ per byte last time I looked. If you need absolute density look for 5 & 6TB Real Soon Now. If you can tolerate some loss variable density disks around 5TB look quite a bit more cost effective.

5) Read ahead, you only need to read ahead by a couple of chunks. For video ball park it at 2MB. You don't really need lots of RAM, think 32-64GB per chassis. Additionally each 8GB dimm costs about the same power as a disk. By going with 32GB instead of 64 I can fit another 2-3 disks in the power budget per chassis.

They are actually using standard SATA drives, see spec here, https://www.netflix.com/openconnect/hardware
"The following system was developed and first deployed at the end of 2011."
Newer revisions use 4 or 6 TB drives and such 15K SAS drives does not exist from what I can find.

See http://oc.nflxvideo.net/docs/OpenConnect-Deployment-Guide.pd...

Start multiple streams of high bandwidth videos and the maximum throughput from the drives drops sharply due to the random reads.

I think this can be fixed by prefetching harder (like 1-2 MB).

100 Netflix streams would be less than 100 MB/s which can be satisfied with a single SATA drive. And some of the Netflix boxes have dozens of SSDs that each do 500 MB/s.

well, if it means that they need to install n boxes in an isp to avoid peering at a high cost then it might be worth it.

> Sometimes it'll even have multiple copies of the same show, if something is in crazy high demand.

that statement from the article lends one to believe that the raid arrays are in parallel.

These boxes are meant to reduce the number of packets going all the way across the internet not necessarily remove the need.

Also, in addition to lots of storage, these boxes probably have loads of ram that they can buffer popular streams into and then serve it.