|
|
|
|
|
by geertj
1416 days ago
|
|
We tried to set up Airflow in our team in the past. The big problem we encounrted is that its unit of management (I believe it's called a "job" but I'm rusty on this) is too low level. Our pipeline processes a lot of data and we have millions of jobs per day. Once Airflow has an (planned or unplanned) outage, 10s of thousands of job start piling up, and it never recovers from that. In the end we replaced our data orchestration with a stateless lambda that for a configured time interval 1/ looks at what output data is missing, 2/ cross-references that with running jobs (in AWS Batch), and 3/ submit jobs for missing data that has no job. Jobs themselves are essentially stateless. They are never restarted and we don't even look at their status. If one fails we notice because there will be a hole in the output and we therefore submit a new one. Some safety precautions are added to prevent a job from repeatedly failing, but that's the exception. Maybe Airflow has moved on from when we last tried it. But this was our experience. |
|
That sounds more like an architecture-at-scale problem than something that is Airflow's 'fault.' Airflow may never have been the right tool for the job but it's getting all the blame.