Hacker News new | ask | show | jobs
by korijn 425 days ago
How does this compare to dbt? Seems like it can do the same?
2 comments

Some similarities, but Bauplan offers:

1. Great Python support. Piping something from a structured data catalog into Python is trivial, and so is persisting results. With materialization, you never need to recompute something in Python twice if you don’t want to — you can store it in your data catalog forever.

Also, you can request anything Python package you want, and even have different Python versions and packages in different workflow steps.

2. Catalog integration. Safely make changes and run experiments in branches.

3. Efficient caching and data re-use. We do a ton of tricks behind to scenes to avoid recomputing or rescanning things that have already been done, and pass data between steps with Arrow zero copy tables. This means your DAGs run a lot faster because the amount of time spent shuffling bytes around is minimal.

To me they seem like the pythonic version of dbt! Instead of yaml, you write Python code. That, and a lot of on-the-fly computations to generate an optimized workflow plan.
Plenty of stuff in common with dbt's philosophy. One big thing though, dbt does not run your compute or manage your lake. It orchestrate your code and pushes it down to a runtime (e.g. 90% of the time Snowflake).

This IS a runtime.

You import bauplan, write your functions and run them in straight into the cloud - you don't need anything more. When you want to make a pipeline you chain the functions together, and the system manages the dependencies, the containerization, the runtime, and gives you a git-like abstractions over runs, tables and pipelines.

I see, this is a great answer. So you don't need any platform or spark or anything. Just storage and compute?
You technically just need storage (files in a bucket you own and control forever).

We bring you the compute as ephemeral functions, vertically integrated with your S3: table management, containerization, read / write optimizations, permissions etc. is all done by the platform, plus obvious (at least to us ;-)) stuff like preventing you to run a DAG that is syntactically incorrect etc.

Since we manage your code (compute) and data (lake state through git for data), we can also provide full auditing with one liners: e.g. "which specific run change this specific table on this data branch? -> bauplan commit ..."