Hacker News new | ask | show | jobs
by tgtweak 2231 days ago
We usually run a tiny ec2 instance with airflow on it to spin up spot market instances right-sized to the job and then map it to EMR templates to initiate the spark cluster and submit the job. This is the most cost effective way I've seen. It is limited to batch and you need to set an upper bounds for the spot bid and bid failure logic (fallback to on demand instances, or wait until next run attempt) but in practice it has seldom failed to secure these instances - a handful of times over the last 3 years.

To give you an idea we run an 8x m4.4xlarge job every hour and it costs less than $800/mo including s3 and exfiltration of the output data. On-demand pricing to keep that cluster up persistently would be about $4900/mo.

So, to OP: great platform, but your real value contribution for large users (the ones with budget) would be any cost optimization features you could build in.

PS k8s spark submit feature is amazingly easy and highly recommended for beginners, set up k8s using rancher and spark-submit your way to data devops bliss.

1 comments

Seconded, I’m doing this as well with airflow and EMR. Instance fleets makes the fallback logic to on-demand instances super easy (you set the price + time allowed for trying spot and then the on-demand instances you want to fall back to).