| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by biellls 1693 days ago

Ordinarily I would not recommend airflow over a script for a one off like this and you are right that it could be done with much less ceremony. I do however think it was a good choice in this case due to the sheer amount of data downloaded and how long it took because airflow gives you the following advantages:

- Parallelization: this may be the most minor advantage in the list, but you don't need to set up thread pools to parallelize your work since the scheduler parallelizes tasks.

- Observability: you easily see which tasks failed and look at the logs.

-Reliability and task isolation: you can set up automatic retries for jobs that fail, and they won't affect the rest of the flow. You can easily relaunch the tasks that fail independently without restarting the whole load.

1 comments

aelzeiny 1693 days ago

All good choices, but it's also worth noting that the structure of the author's DAG is strange. The way this is set up, you would have to change Python code to run (or rerun) this task for different days.

In canonical Airflow, the job would be one DAG, and each day would be a separate DAG run. Then you would backfill all the days that you would like the job to run. If there's some sort of max-concurrency requirement, that would be handled by setting the `max_active_runs` parameter or by using Airflow's pool concept.

If I had to venture a guess, the author is not an experienced Airflow user, and just wanted to give a new technology an honest try.