Hacker News new | ask | show | jobs
by luizfelberti 218 days ago
Honestly this benchmark feels completely dominated by the instance's NIC capacity.

They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)

Minute differences in how these query engines schedule IO would have drastic effects in the benchmark outcomes, and I doubt the query engine itself was constantly fed during this workload, especially when evaluating DuckDB and Polars.

The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.

7 comments

It would be amusing to run this on a regular desktop computer or even a moderately nice laptop (with a fan - give it a chance!) and see how it does. 650GB will stream in quite quickly from any decent NVMe device, and those 8-16 cores might well be considerably faster than whatever cores the cloud machines are giving you.

S3 is an amazingly engineered product, operates at truly impressive scale, is quite reasonably priced if you think of it as warm-to-very-cold storage with excellent durability properties, and has performance that barely holds a candle to any decent modern local storage device.

Absolutely. I recently reworked a bunch of tests and found my desktop to outcompete our (larger, custom) Github Action runner by roughly 5x. And I expect this delta to increase a lot as you lean on the local I/O harder.

It really is shocking how much you're paying given how little you get. I certainly don't want to run a data center and handle all the scaling and complexity of such an endeavour. But wow, the tax you pay to have someone manage all that is staggering.

Everyone wants a data lake when what they have a is a data pond.
I think you meant puddle.

cue Peppa Pig laughter sounds

Totally true. I have a trusty old (like 2016 era) X99 setup that I use for 1.2TB of time series data hosted in a timescaledb PostGIS database. I can fetch all the data I need quickly to crunch on another local machine, and max out my aging network gear to experiment with different model training scenarios. It cost me ~$500 to build the machine, and it stays off when I'm not using it.

Much easier obviously dealing with a dataset that doesn't change, but doing the same in the cloud would just be throwing money away.

> They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)

The query being tested wouldn't scan the full files and in reality the query in most sane engines would be processing much less than 650GB of data (exploiting S3 byte-range reads): i.e. just 1 column: a timestamp, which is also correlated with the partition keys. Nowadays what I would mostly be worried about the distribution of file size, due to API calls + skew; or if the query is totally different to the common query access patterns that skips the metadata/columnar nature of the underlying parquet (i.e. doing an effective "full scan" over all row groups and/or columns).

> The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.

That's absolutely right.

Yep I think the value of the experiment is not clear.

You want to use Spark for a large dataset with multiple stages. In this case, their I/O bandwidth is 1GB/s from S3. CPU memory bandwidth is 100-200GB/s for a multi-stage job. Spark is a way to pool memory for a large dataset with multiple stages, and use cluster-internal network bandwidth to do shuffling instead of storage.

Maybe when you have S3 as your backend, the storage bandwidth bottleneck doesn't show up in perf, but it sure does show up in the bill. A crude rule of thumb: network bandwidth is 20X storage, main memory bandwidth is 20X network bandwidth, accelerator/GPU memory is 10X CPU. It's great that single-node DuckDB/Polars are that good, but this is like racing a taxiing aircraft against motorbikes.

Network bandwidth is not 20x storage ant more. An SSD is around 10GB/s now, so similar to 100Gb ethernet.
I think I'm talking about cluster-scale network bisection bandwidth vs attached storage bandwidth. With replication/erasure coding overhead and the economics, the order of magnitude difference still prevails.

I think your point is a good one in that it is more economics than systems physics. We size clusters to have more compute/network than storage because it is the design point that maximizes overall utility.

I think it also raises an interesting question in that let's say we get to a point where the disparity really no longer holds: that would justify a complete rethinking of many Spark-like applications that are designed to exploit this asymmetry.

And that's for one SSD. If you're running on a server rather than a laptop, aggregate storage bandwidth will almost certainly be higher than any single network link.
The appropriate comparison point for aggregate cluster storage bandwidth would be its bisection bandwidth.

(I do HPC, IIRC ANL Aurora is < 1PB/s DAOS and 20 PB/s bisection).

10Gbps only? At Google where this type of processing would automatically be distributed, machines had 400Gbps NICs, not to mention other innovations like better TCP congestion control algorithms. No wonder people are tired of distributed computing.
You can get a 600Gbps interface on an Amazon EC2 instance (c8gn.48xlarge), if you’re willing to pay for it.
"At Google" is doing all the heavy lifting in your comment here, with all due respect. There is but one Google but remain millions of us who are not "At Google".
I’m merely describing the infrastructure that at least partially led to the success of distributed data processing. Also 400Gbps NIC isn’t a Google exclusive. Other clouds and on-premise DCs could buy them from Broadcom or other vendors.
The infra might have a 400Gbps NIC, but if you're buying a small compute slice on that infra, you don't get all the capability.
They do at AWS, too but op paid for a small VM
This is a really good observation, and matches something I had to learn painfully over 30 years ago. At a Wall Street bank, we were trying to really push the limits with some middleware, and my mentor at the time very quietly suggested "before you test your system's performance, understand the theoretical maximum of your setup first with no work".

The gist was - find your resource limits and saturate them and see what the best possible performance could be, then measure your system, and you can express it as a percentage of optimal. Or if you can't directly test/saturate your limits at least be aware of them.

I'm kind of suprised they didn't choose an ec2 instance with higher throughput. S3 can totally eek out 100s of Gibps with the right setup.

BUT the author did say this is the simple stupid naive take, in which case DuckDB and Polars really shined.

c5 is such a bad instance type, m6a would be so much better and even cheaper, I would love to see this on an m8a.2xlarge (7th and 8th generations don’t use SMT) and that is even cheaper and has up to 15 Gbps
Actually for this kind of workload 15Gbps is still mediocre. What you actually want is the `n` variant of the instance types, which have higher NIC capacity.

In the c6n and m6n and maybe the upper-end 5th gens you can get 100Gbps NICs, and if you look at the 8th gen instances like the c8gn family, you can even get instances with 600Gbps of bandwidth.

The math here is weird.

A Samsung 990 Pro reads at something like 50 Gbps and PCIe 4.0 x4 is quite a bit faster than that. You can get this speed with a queue depth that isn’t crazy, and you can have multiple NVMe operations in flight reading the same large Parquet file. Latency is in the tens of microseconds.

The consensus seems to be that S3 can read one object at somewhat under 1Gbps. You can probable scale that to the full speed of your NIC by reading multiple objects at once, but you may not be able to scale by reading one object in multiple overlapping ranges. Latency is in the milliseconds.

So, sure, an EC2 with a fast instance and massive multiple object parallelism can have 10x higher bandwidth than an NVMe device, but the amount of parallelism and latency tolerance needed is a couple orders of magnitude higher than NVMe. Meanwhile that NVMe device does not charge for read operations and costs a couple hundred dollars, once.

If you are so inclined, you can build an NVMEoF setup (at much much higher cost) that separates compute and storage and has excellent performance, but this is a nontrivial undertaking.