IMO, spark isn't the way forward. The typical pattern with it is it lets you scale up to 100 cores really easily which is almost enough to compete with a good single threaded implementation in a fast language.
The workflows I deal with generally involve moving hundreds of terabytes of storage into memory, processing it, and writing it out. Single machines (even beefy ones) tend to hit their limits (networking, max RAM, cache size, TLB, etc).
Maybe there's another tool better than spark, i don't know, the important thing is that spark is the most ubiquitous.
The workflows I deal with generally involve moving hundreds of terabytes of storage into memory, processing it, and writing it out. Single machines (even beefy ones) tend to hit their limits (networking, max RAM, cache size, TLB, etc).
Maybe there's another tool better than spark, i don't know, the important thing is that spark is the most ubiquitous.