And then add PySpark on top of that. Couldn't leave my last job fast enough when they decided to use Hadoop/PySpark when the largest incoming files we received were at most a few GBs.
I once had a consulting gig where the customer desperately wanted to build a Spark/Scala ML pipeline, for a dataset that was 10 MB. We spent 3 months hammering it together for a flat Python process that would've taken us 2 weeks.
> This find xargs mawk pipeline gets us down to a runtime of about 12 seconds, or about 270MB/sec, which is around 235 times faster than the Hadoop implementation.