| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by data-ottawa 155 days ago

Airflow and dbt serve a real purpose.

The issue is you can run sub tib jobs on a few small/standard instances with better tooling. Spark and Hadoop are for when you need multiple machines.

Dbt and airflow let you represent your data as a DAG and operate on that, which is critical if you want to actually maintain and correct data issues and keep your data transforms timely.

edit: a little surprised at multiple downvotes. My point is, you can run airflow and dbt on small instances, and you can do all your data processing on small instances with tools like duckdb or polars.

But it is very useful to use a tool like dbt that allows you to re-build and manage your data in a clear way, or a tool like airflow which lets you specify dependencies for runs.

After say 30 jobs or so, you'll find that being able to re-run all downstreams of a model starts to payoff.

2 comments

adammarples 155 days ago

Agreed, airflow and dbt have literally nothing to do with the size of the data and can be useful, or overkill, at any size. Dbt just templates the query strings we use to query the data and airflow just schedules when we query the data and what we do next. The fact that you can fit the whole dataset in duckdb without issue is kind of separate to these tools, we still need to be organised about how and when we query it.

link

x0x0 155 days ago

dbt is super useful for building a dag and managing pieces of it that update on different schedules. eg with one dataset that's refreshed monthly and another daily, you can only rebuild the daily one unless the slower-cadence input has a new update.

link