|
|
|
|
|
by data-ottawa
155 days ago
|
|
Airflow and dbt serve a real purpose. The issue is you can run sub tib jobs on a few small/standard instances with better tooling. Spark and Hadoop are for when you need multiple machines. Dbt and airflow let you represent your data as a DAG and operate on that, which is critical if you want to actually maintain and correct data issues and keep your data transforms timely. edit: a little surprised at multiple downvotes. My point is, you can run airflow and dbt on small instances, and you can do all your data processing on small instances with tools like duckdb or polars. But it is very useful to use a tool like dbt that allows you to re-build and manage your data in a clear way, or a tool like airflow which lets you specify dependencies for runs. After say 30 jobs or so, you'll find that being able to re-run all downstreams of a model starts to payoff. |
|