Seeing as this article is from 2019, would people still recommend airflow to ETL data from APIs to DataWarehouse/DataLakes or is there something better in the market?
Nothing is more popular yet, but there are better
architected options out there. It's hit 17k GitHub
stars and was used by the team I was previously at.
I don't think anything will beat it unless something from the CI/CD or "cloud native" world moves in unexpectedly.
The operators and scalability are somewhat useful.
I was happy with the UI compared to cron. Testing
is a mess. Also, Airflow isn't CI/CD-friendly (but
it's possible to get it to work).
I'd recommend a managed option unless you have
a skilled ops team. It reminds me of Hadoop in terms
of how exciting it is to get set up, which isn't a good
thing.
I can confirm all of this. I was involved with setting up airflow recently and we had a rather rough time because it is kind of a half assed solution. It's basically a framework that allows you to do stuff with a lot of plugins/connectors that may or may not be useful for you with a rather large variation in completeness, bugginess, documentation, and utility. A lot of it is kind of sketchy or even actively harmful but there are definitely some useful things as well.
It does not help that the entirety of the documentation is written from the point of view of people who are definitely not of the devops variety doing things manually on their laptop. I.e. all the wrong things you should never do in a production setup. Configuring this thing for production usage is largely undocumented, non trivial, and you'll be piecing things together from stackoverflow and various third party github repositories for e.g. using docker, terraform, etc. rather than the official documentation which merely hints at these things being possibilities.
It also does not help that the internals are kind of buggy and wonky. We had a really hard time getting the basic plumbing for running workers, queues, etc. working properly. It would constantly grind to a halt and stop processing stuff. Also there's this minutes long uncertainty principle "is it actually running or still figuring out that it needs to catch up?!".
Also, the UI/UX is terrible IMHO. Think hitting cmd+r a lot because page refreshes are not a thing in Airflow and absolutely everything requires dealing with multiple clicks to navigate complex dialogs (modal, naturally). So, unless you just manually reloaded the page: you are looking at stale information. Jobs that have long finished. Green statuses that have turned red, etc. Even Jenkins/Hudson had auto reload 15 years ago. And given the significant overlap in functionality, you might actually be better off using that if all you need is the ability to run some simple job at specific intervals.
The only valid reason for using Airflow is the ecosystem of plugins. It's valid and it's basically the same reason that people tolerated the craptastic experience that was managing Nagios back in the day. Horribly complicated to setup, terrible/primitive UI, loads of performance issues, non trivial failure modes, etc. but world + dog used it and there were nagios plugins for just about everything. I've been that rabbit hole as well and I'd say the experience is similar enough.
So, definitely use it in hosted form if you can or avoid altogether unless you really need it.
Yeah, like the other reply, I'd mostly say testing DAGs was an issue. Airflow-related configuration is easy to get wrong and it silently fails a lot.
Now that I think about it though, most of the time I spent on testing wasn't caused by Airflow. Testing data pipelines just isn't easy with the current well-known tooling.
Probably DAGs - Operators can be tested from their hooks, but in my experience testing a DAG is annoying - I usually just make a copy that does a dry run/runs with test data, or just test in a local airflow container as it's much faster.
All that I've heard about Airflow is that it's intricately coupled to Python, and can be finicky to maintain as it contains a fair few moving parts.
I've previously used Argo Worfklows, which I prefer because I already have a Kubernetes environment and because it runs containers, it's totally language-agnostic which I think is a huge benefit. It also has a huge number of features for defining and controlling the workflows. Downside is that it's configuration/definition YAML's can get large and a bit messy (as YAMLs are want to do) - however, templated workflows are coming soon, which should hopefully reduce the noise.
Personally I'm reticent to use anything that re-implements its own full scheduling system instead of hooking into a pre-existing (and probably more bulletproof) one (i.e. K8s scheduler), and anything that _requires_ me to write all of my ETL/schedulable code in Python.
Argo implements its own scheduler AFAIK, otherwise how would it manage dependencies and the execution graph. The part that Argo is using K8S for is orchestration, which Airflow can do as well with the KubernetesPodOperator, but it's not a cloud native solution and it spins the whole scheduler and backend for each task.
We tried to make Polyaxon[0] work with Airflow for Machine Learning specific workflows, but it was very painful and it does not have a good state/artifacts management, which leaves the users tweaking around. We end up making a simple abstraction on top K8S, much easier, to provide features for parallel executions, dependencies, failure handling, retries, ... as well as handling ML specific graphs such as hyperparameter tuning and distributed scheduling.
By the way, Polyaxon looks awesome, I’ve been wanting to try it for a while, but just don’t have any machine learning projects in the pipeline at the moment alas.
Prefect looks great, but it uses Prefect Community License for parts of it, that has an exclusion:
" Licensee is not granted the right to, and Licensee shall not, exercise the License for an Excluded Purpose. For purposes of this Agreement, "Excluded Purpose" includes, but is not limited to, using the Software, or any derivative works thereof, to make available any software-as-a-service, platform-as-a-service, infrastructure-as-a-service or other similar service that competes with Prefect products or services."
While I'm interested in using Prefect as part of SaaS I'm working on, I'm having trouble defining whether it would compete with their offering or not. In my SaaS Prefect UI will not be exposed, I need an ETL engine "behind the scenes" for parts of the whole workflow (some action, sends an event, and on that event a job is triggered).
In theory Prefect SaaS could be used to do the same, so I guess that would mean I'm competing with them?
On the other hand Flyte looks very young, that could mean it's not mature or hard to use for non-Lyft use cases.
Hi! Prefect CEO here. We made Prefect freely available for exactly this reason - so you can use Prefect and its UI to ensure that your business's processes are running smoothly. Broadly speaking, your internal use won't violate our license at all (unless your SaaS is a workflow orchestration platform, in which case please check out our open jobs because you're the sort of person we'd love to talk to :) ). If you have any questions at all, always happy to help.
I'm so very hesitant of workflows-as-a-service: they just seems like a fantastic way to get really locked-in to some incredibly specific vendor; add in the fact that if you're required to maintain geographical-jurisdiction of your data, or you have constraints about 3rd-parties having access to your data then you're more-or-less out of luck.
Also they all seem to charge so very much for what amounts to a fairly straightforward service...
It's OK, but I noticed that my colleagues started cursing a lot more after we started using it. It's still a bit rough around the edges, and there are interesting pitfalls you just learn after failing.
We mainly use the K8sOperator, and the logic is mainly inside independent containers. Therefore our development and testing is not so tightly coupled to Python or Airflow.
Once you get past some learning curve and initial configuration, Airflow is great. We use third party etl tools where we can - like fivetran, stitch - but airflow still orchestrates the bulk of our etl.
The operators and scalability are somewhat useful. I was happy with the UI compared to cron. Testing is a mess. Also, Airflow isn't CI/CD-friendly (but it's possible to get it to work).
I'd recommend a managed option unless you have a skilled ops team. It reminds me of Hadoop in terms of how exciting it is to get set up, which isn't a good thing.