Hacker News new | ask | show | jobs
by danking00 849 days ago
It'd be interesting to see a peak-sequential-bandwidth by cost-per-gigabyte plot. The number I keep in my head is 500 MiB/s, but you're right that there are much faster drives out there [1]. Of the public clouds: Google's "Local SSD" claims ~12,000 MiB/s but they're ephemeral and you need 12 TiB of disks to hit that bandwidth [2][4]. AWS has these io2 SSDs which claim 4,000 MiB/s [3].

On the other points of the article, even if you had a huge disk array plugged into the machine, how many cores can you also plug into that computer? I suppose there will always be a (healthy, productive) race here between the vertical scaling of GPUs + NVMe SSDs and the horizontal scaling of CPUs and blob storage.

EDIT: formatting.

[1] First Google result is Tom's hardware: https://www.tomshardware.com/features/ssd-benchmarks-hierarc...

[2] https://cloud.google.com/compute/docs/disks/local-ssd#nvme_l...

[3] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/provisio...

[4] The ephemerality has two downsides. First, you have to get the data onto that local SSD from some other, probably slower, storage system (I haven't benchmarked GCS lately, but that's probably your best bet for quickly downloading a bunch of data?). Second, you need to use non-spot instances which are 3-6x the price.

4 comments

With AMD Epycs having 128 PCIe 4.0 lanes, using say 96 for disks thats 192GB/s of aggregate bandwidth. With 16TB U.2 SSDs that's up to 1.5PB of storage if you use one lane per disk.

Not for your average homelab budget but...

And "characteristic time": disk size / speed. In other words, time required to read/write the full disk.
I'm doing some consulting with a client with a few terabytes in SQL Server. He keeps talking about challenges in reprocessing data in a migration to clickhouse.

I find it interesting that the solution to a lot of the problems is to just reprocess the data and don't try to optimize anything. 10 TB is not a lot of data with NVMe.

I gathered a table for Google and Amazon's options. I do not have experience with on-prem solutions so I don't know how to compare these prices to the cost of owning and operating hardware. I'm sure its cheaper over time for the hardware but I imagine you need sufficient scale to amortize the personnel costs.

Storage

    | product              | price (USD/GiB-month) | price (USD/IOPS-month)            | claimed max read bandwidth (MiB/s) | minimum price to achieve bandwidth |
    |----------------------|-----------------------|-----------------------------------|------------------------------------|------------------------------------|
    | GCP NVMe Local SSD   | 0.1046 [1]            | 0.00                              | 12,480 [2]                         | 13,000 USD/month [3]               |
    | AWS io2 SSD          | 0.1250 [4]            | 0.065 [4,5]                       | 4,000 [6]                          | 1,042 USD/month [4,6,14]           |
    | AWS io1 SSD          | 0.1250 [4]            | 0.065 [4,5]                       | 500 [7]                            | 135 USD/month [4,7,15]             |
    | Google Cloud Storage | 0.0200 [8]            | 0.0004 per "1k Class B Op" [8,16] | 23,842 [9]                         | [10]                               |
    | AWS S3               | 0.0230 [11]           | 0.0004 per 1k GET [8,16]          | 11,921? [12]                       | [13]                               |
You could build a similar table for compute but it gets complicated. FLOP seems like a reasonable unit of compute, but there are things other than FLOPs (e.g. decoding your column-oriented compression scheme).

I've tried to do this comparison a few times but I usually find it hard to get clear aggregate FLOP numbers for GPUs. GPUs also require caretaker CPUs and I don't have experience using them so I'm not certain how to spec a VM that can practically saturate the compute of the GPUs. My gut instinct is that the big compute consumers must be able to arbitrage this to some extent by shifting some workloads to chase the cheapest FLOP.

EDIT(2x): Table formatting. We could really use some Markdown styling on HN.

EDIT3: Clarify incomparability of IOPS and GETs.

[1] https://cloud.google.com/compute/disks-image-pricing#localss...

[2] https://cloud.google.com/compute/docs/disks/local-ssd#nvme_l...

[3] For 12TiB. https://cloud.google.com/compute/docs/disks/local-ssd#nvme_l...

[4] https://aws.amazon.com/ebs/pricing/

[5] We only need need 16,000 IOPS for peak performance of io2, so I ignore the drop in price at higher volumes.

[6] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/provisio...

[7] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/provisio...

[8] https://cloud.google.com/storage/pricing#price-tables

[9] https://cloud.google.com/storage/quotas#bandwidth

[10] Honestly not sure. You can do 5,000 parallel "reads" per second to a single object. I'm not sure what kind of instance you need to receive 23,842 MiB/s or if a single object can actually deliver that much bandwidth. https://cloud.google.com/storage/quotas#objects

[11] It gets slightly cheaper as volume goes up. https://aws.amazon.com/s3/pricing/

[12] I could not find a clear answer, but it seems at least 100 Gbps (11,921 MiB/s) https://repost.aws/knowledge-center/s3-maximum-transfer-spee...

[13] As with Google [10], I'm not really sure.

[14] For 16GiB and 16k IOPS.

[15] For 40GiB and 2000 IOPS.

[16] Not really comparable to provisioned IOPS because you pay for the IOPS once per month whereas you pay for every individual GET request.