|
|
|
|
|
by budde
2680 days ago
|
|
Yeah, this is really, really far from an apples-to-apples comparison. First of, the test dataset size is trivially small for usecases where big data systems are typically applied. I don't know why you'd introduce all the complexity and overhead of a distributed mapreduce framework to ETL a dataset that would fit in memory on consumer-grade hardware. It's not exactly fair to compare a framework running on a single node to one where you've artificially introduced multiple nodes and network overhead for a dataset that would easily fit on one. You'll also notice a pretty stark difference between the level of detail provided for the BlazingSQL test set up and the Spark one, which (unless I'm missing something) is lacking any code or configuration details. I've dipped my toes in the big data space long enough and seen enough "${FANCY NEW FRAMEWORK} beats ${INDUSTRY-STANDARD FRAMEWORK} by 123x!!" posts to recognize this as a gigantic red flag. How you manage partition sizes, order and choice of operations, and tuning parameters can make orders-of-magnitude level differences to your performance. Maybe the future of frameworks like this will be on the GPU. I'm just not seeing any evidence of it yet. Right now, Spark fills the space where you can throw globs of memory at TB- to PB-scale problems. I could very well be wrong, but I don't see how this is going to be cost-effective on GPUs given the current cost of memory there. |
|
2. The dataset is trivially small because this is a new engine built for the rapids eco system and it is limited for the time being to a single node. We are releasing our distributed version for GTC (mid March) and will be able to give you more reasonable comparisons. This is a similar path of development to our pre Rapids engine which went from single node to distributed in about a month because we have built this engine to be distributed. Right now we are finishing up UCX integration which is the layer we will be using to communicate between all the nodes.
3. You can always try it out. Its own dockerhub (see links in this post) and if you want to run distributed workloads right now you can manage that process using dask by handling the splitting up of the job yourself. In a few weeks you will be able to have the job split up for you automatically without need for the user to be aware of the size of the cluster or how to distribute data across it.