And then add PySpark on top of that. Couldn't leave my last job fast enough when they decided to use Hadoop/PySpark when the largest incoming files we received were at most a few GBs.
I once had a consulting gig where the customer desperately wanted to build a Spark/Scala ML pipeline, for a dataset that was 10 MB. We spent 3 months hammering it together for a flat Python process that would've taken us 2 weeks.
> This find xargs mawk pipeline gets us down to a runtime of about 12 seconds, or about 270MB/sec, which is around 235 times faster than the Hadoop implementation.
I build it in a container for work and didn't find it that difficult to be honest. And Google has plenty of example Dockerfiles that show the steps needed.
The only real system dependencies are Java8, maven and texlive (and Python/R if you build for that). Then it's `make-distribution.sh` with the appropriate flags. Scala and everything else that is needed is downloaded by maven. The resulting directory is self-contained assuming you have java8 runtime on your target machine.
Sell talk and buzzwords. Either author has no idea that Hadoop is ecosystem and Spark depends on it or deliberately mix Hadoop and Kubernetes, which aren't much related.
Even if you don't run HDFS and YARN, you aren't escaping Hadoop. And if some configuration goes wrong, and you'll probably need to look into the Hadoop conf files.
The original comment was about the mass of libraries that Hadoop brings in. Spark isn't a solution that allows you to leave the mess. If you try to dockerize spark, you'll still see that you have 300 MB size images full of JARs that came from wherever.
I've tried building it for my day job, I would rather have a colonoscopy without sedation.
It's more pleasant and dignified.