Hacker News new | ask | show | jobs
by ianbutler 1989 days ago
Glue is both more of a pain in the butt than regular old spark with pyspark and way more expensive, from my experience I would seriously question someone suggesting to use it.

We could have been using it wrong, but porting our Glue scripts to standard EMR after our initial POC saved us over 10x the cost and it was substantially faster.

1 comments

Both pricing and start-up times are significantly better in Glue 2.0 (assuming one can migrate). But even on Glue 1.0, orchestrating an ETL process with with several dozen jobs is a non-trivial amount of configuration and labor. (Jobs failures, job restarts, paging, job run history, cloudwatch logs, re-usable infrastructure as code when creating a new jobs, permissions and security, etc) that the increased cost is more than worth it for us.

https://aws.amazon.com/blogs/aws/aws-glue-version-2-0-featur...

We're crawling and processing TBs of web data, we just use some python workers, Airflow, SQS and trigger a few scheduled EMR jobs easy peasy. Restarts and what not are handled by kubernetes at the container level and by Airflow at the code level. Airflow bakes in permissions and managing jobs. Glue left us a lot to be desired in that area, and $400-600 per ingest can't beat $30 bucks for the time the EMR cluster is up and since we use Kube for everything already it wasn't much a hassle to continue using it here. I'm sure in your case it makes sense, and in ours it didn't and this is why technology is crazy :P