Hacker News new | ask | show | jobs
by Bedon292 3411 days ago
I know its just a benchmark for comparison, and it is awesome. I love seeing cool comparisons like this, but why do I care that this particular benchmark is faster than Spark? What sort of analytics will be affected by this improvement, and will it actually be saving me time on real world use cases?
3 comments

Our impression was that when Databricks released the billion-rows-in-one-second-on-a-laptop benchmark, readers were pretty awed by that result. We wanted to show that when you combine an in-memory database with Spark so it shares the same JVM/block manager, you can squeeze even more performance out of Spark workloads (over and above Spark 's internal columnar storage). Any analytics that require multiple trips to a database will be impacted by this design. E.g. workloads on a Spark + Cassandra analytics cluster will be significantly slower, barring some fundamental changes to Cassandra.
Good questions. One answer - Speed in analytics when working with large data sets matters. A lot. Think about this - several vendors seem to be claiming support for interactive analytics. i.e. i can ask an adhoc question on large volume of data and get some sort of insight in "interactive" times. Really? maybe with a thousand cores? In a competitive industry, say like in investment banking, if one can discern a pattern before the competition it simply provides an edge. Ask the question to folks involved in detecting fraud, or online portals trying to place an appropriate ad, or manufacturing plant trying to anticipate/predict faults. It isn't so much about trying to prove we are better than Spark (well other than grabbing some attention :-) ) but rather the potential to live in a world where batch processing is a thing of the past - like working with mainframes. The hope is that we can gain insight instantly. fwiw, we love Spark and expect some of these optimizations simply become part of Apache Spark.
it's literally impossible to test every possible workload for a tool like Spark so... what's the point of asking? Stand up a testing cluster, run some of your jobs and you'll get the answer.
I am not asking about every workload. I was just curious about an example workload where this benchmark matters.