| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by itg 2384 days ago
	And then add PySpark on top of that. Couldn't leave my last job fast enough when they decided to use Hadoop/PySpark when the largest incoming files we received were at most a few GBs.

1 comments

boxy310 2384 days ago

I once had a consulting gig where the customer desperately wanted to build a Spark/Scala ML pipeline, for a dataset that was 10 MB. We spent 3 months hammering it together for a flat Python process that would've taken us 2 weeks.

link

snaky 2384 days ago

> This find xargs mawk pipeline gets us down to a runtime of about 12 seconds, or about 270MB/sec, which is around 235 times faster than the Hadoop implementation.

https://adamdrake.com/command-line-tools-can-be-235x-faster-...

link

buzzkillington 2384 days ago

If you'd sent it off to mechanical Turk it would have been done in an afternoon.

link