| A friend of mine wanted an ETL (SQL Server to BQ for analysis and dashboarding) set up and I ended up stumbling across Airflow. I spun up two VMs on GCP, one for Airflow and the other for the Postgres DB to store the metadata. - A few things I've noticed is Airflow generates a tonne load of logs that will fill up your disk quite fast. I started with 100GB and I'm now at 500GB, granted disk space isn't expensive, but still even with a few DAGs i'm surprised at how quickly. Apparently you need a DAG to run to clear those logs but I was too lazy so I just purge the logs using a cron job. - The SQL Server Operator is buggy, I filed an issue with the Airflow team but I had to do some hacky stuff to get it to work. - Even with a few DAGs, Airflow will spike the CPU utilization of the VM to 100% for X minutes (in my case about 15 minutes) which is quite interesting. My tasks basically query SQL Server -> dump to CSV (stored on GCS) -> import to BQ. - My DAGs execute every hour, and if Airflow is down for X hours and I resolve the issue, it will try to run all the tasks for the hours it was down which isn't ideal because it will take hours to catch up. So I've had to delete tasks and only run the most recent ones. Granted my set up is pretty simple and YMMV, but Airflow has done what it needs to do albeit with some pain. |
Have you checked why that is? Airflow does Reimport every few seconds. We've had an issue where it didn't honor the airflowignore file making it execute our tests everx few seconds. The easy solution was to put them into the docker ignore.
You might also be having too much logic in your root levels. It's recommended to not even import at root level to make importing faster.
Not saying it's not an odd tool though.