|
|
|
|
|
by kfk
2366 days ago
|
|
Me and my team tried Airflow but found it didn’t fit well with our analytics work flow. For instance you must rewrite your jupyter notebook into an Airflow dag, doing basically the same work 2 times. We use dask and will soon deploy dask.distributed. I have yet to figure out where Airflow actually fits in the BI/data science architecture. |
|
These activities are usually managed by cron and more often by advanced scheduler tools (depending on the vendor), so it's quite a core part of any architecture that needs to e.g. load/reload/refresh data periodically.
If the requirement is simply to connect notebooks to a data lake, then the only scheduling required is to load the data lake, and something like Airflow may be overkill for this, depending on what/how the data is processed and loaded.