|
|
|
|
|
by iamsomewalrus
2884 days ago
|
|
The AWS Glue service provides a serverless Spark environment for running jobs. Here's a link: https://docs.aws.amazon.com/glue/latest/dg/author-job.html * The default timeout is ~ 48 hours and you pay per Data Processing Unit (DPU) that you've provisioned the Job.
* Currently it supports Python and Scala. As far as I'm aware you can't run Java jobs directly, but you can upload JAR libraries and use them in your code. Re serverless vs dedicated / ephemeral clusters:
Like with any serverless runtime environment you are trading convenience (across a few dimensions) for flexibility. The Glue environment runs in a few limited runtimes and uses a specific version of Spark that you have no control over updating. Given that it's pretty quick to author a job, you can set the required DPU and Glue handles that, and you don't have to worry about sizing the cluster for your data size. For me most of my jobs fit within those constraints. At some point on the cost curve it may make sense for you to move all of your jobs from Glue into a dedicated cluster on EMR. You may also get there sooner if you need to use specific frameworks or libraries. |
|