|
Author here - appreciate the comments and reads. To add a bit of color -- I spent about a month looking into orchestrators to migrate Whatnot's data platform onto earlier this year, and it was a miserable experience. We were on AWS Managed Airflow, but to stay on it and have a solid platform, I would have been writing Github Actions for CI/CD, standing up ECR and IAM roles with Terraform, setting up EKS to run Kubernetes jobs, managing infra monitoring with Datadog, etc., etc. In fact, I did end up doing all those things, but we opted for Dagster Cloud, because of their focus on improving developer efficiency. Their team provided pre-built Github actions for CI/CD and recently introduced PR-specific branch deployments, which has been amazing. They're moving towards serverless execution, built-in ECR repositories, managed secrets. Prefect and Astronomer I expect are moving in this direction, too, but I liked the Dagster project's energy quite a bit. As I've waded into the MLOps world as well, it just keeps looking like every platform basically devolves into : an orchestrator that provisions compute resources and logs metadata into an opinionated data model. Catalog tools like Atlan are metadata sinks that are trying to build out orchestration/workflow capabilities. dbt Cloud of course is just an orchestrator for a specific type of data product that is aiming to operationalize metadata with its metrics layer. Orchestration + a metadata data model is a common denominator here, and I think the fact that Airflow is so inevitable has made it really hard for people to imagine the category as anything other than a scheduler, but perhaps some of these new companies can break new ground. |
One Q - it seems to me that another possible solve (and probably how the big guys tend to do it) is to use a dataflow engine like Spark/Flink. Did you compare a managed platform like Google Dataproc? They also have serverless if you don’t want a heavy managed cluster, which might make this approach more viable for non-huge companies that wouldn’t utilize a min-spec cluster. (When I last evaluated this they didn’t have serverless which was a dealbreaker for my small scale).