Hacker News new | ask | show | jobs
by itg 2384 days ago
And then add PySpark on top of that. Couldn't leave my last job fast enough when they decided to use Hadoop/PySpark when the largest incoming files we received were at most a few GBs.
1 comments

I once had a consulting gig where the customer desperately wanted to build a Spark/Scala ML pipeline, for a dataset that was 10 MB. We spent 3 months hammering it together for a flat Python process that would've taken us 2 weeks.
> This find xargs mawk pipeline gets us down to a runtime of about 12 seconds, or about 270MB/sec, which is around 235 times faster than the Hadoop implementation.

https://adamdrake.com/command-line-tools-can-be-235x-faster-...

If you'd sent it off to mechanical Turk it would have been done in an afternoon.