Hacker News new | ask | show | jobs
by SmirkingRevenge 2981 days ago
If your spark jobs are mostly batch workloads, that can tolerate moderately infrequent failures and restarts, try using google dataproc with preemptible vms or amazon emr using spot instances.

Depending on your use case, you might spend many times less than you would using regular VMs. Many instances that are several dollars an hour on AWS can be used for a fraction of the price.

Its also fairly easy to automate the region selection and bid (on AWS that is, not sure about gcloud).

If you need streaming, obviously this might not be the way to go.

1 comments

I find the EMR markup to be substantial; if I weren't working in a corporation, I would stand up my own spark clusters, e.g. using spark-ec2
It is, but you can run the bare minimum number of core nodes (3 I think?) and use spot instances for any others.

At a previous job, we just built our own ec2 image that ran spark in standalone mode for ephemeral spark clusters, and it was wonderful and cheap. And the clusters launched very fast compared to EMR.