Hacker News new | ask | show | jobs
by jtagliabuetooso 425 days ago
Looking to get feedback for a code-first platform for data: instead of custom frameworks, GUIs, notebooks on a chron, bauplan runs SQL / Python functions from your IDE, in the cloud, backed by your object storage. Everything is versioned and composable: time-travel, git-like branches, scriptable meta-logic.

Perhaps surprisingly, we decided to co-design the abstractions and the runtime, which allowed novel optimizations at the intersection of FaaS and data - e.g. rebuilding functions can be 15x faster than the corresponding AWS stack (https://arxiv.org/pdf/2410.17465). All capabilities are available to humans (CLI) and machines (SDK) through simple APIs.

Would love to hear the community’s thoughts on moving data engineering workflows closer to software abstractions: tables, functions, branches, CI/CD etc.

6 comments

I am very interested in this but have some questions after a quick look

It mentions "Serverless pipelines. Run fast, stateless Python functions in the cloud." on the home page... but it took me a while of clicking around looking for exactly what the deployment model is

e.g. is it the cloud provider's own "serverless functions"? or is this a platform that maybe runs on k8s and provides its own serverless compute resources?

Under examples I found https://docs.bauplanlabs.com/en/latest/examples/data_product... which shows running a cli command `serverless deploy` to deploy an AWS Lambda

for me deploying to regular Lambda func is a plus, but this example raises more questions...

https://docs.bauplanlabs.com/en/latest/commands_cheatsheet.h... doesn't show any 'serverless' or 'deploy' command... presumably the example is using an external tool i.e. the Serverless framework?

which is fine, great even - I can presumably use my existing code deployment methodology like CDK or Terraform instead

Just suggesting that the underlying details could be spelled out a bit more up front.

In the end I kind of understand it as similar to sqlmesh, but with a "BYO compute" approach? So where sqlmesh wants to run on a Data Warehouse platform that provides compute, and only really supports Iceberg via Trino, bauplan is focused solely on Iceberg and defining/providing your own compute resources?

I like it

Last question is re here https://docs.bauplanlabs.com/en/latest/tutorial/index.html

> "Need credentials? Fill out this form to get started"

Should I understand therefore that this is only usable with an account from bauplanlabs.com ?

What does that provide? There's no pricing mentioned so far - what is the model?

> or is this a platform that maybe runs on k8s and provides its own serverless compute resources?

This one, although it’s a custom orchestration system, not Kubernetes. (there are some similarities but our system is really optimized for data workloads)

We manage Iceberg for easy data versioning, take care of data caching and Python modules, etc., and you just write some Python and SQL and exec it over your data catalog without having to worry about Docker and all infra stuff.

I wrote a bit on what the efficient SQL half takes care of for you here: https://www.bauplanlabs.com/blog/blending-duckdb-and-iceberg...

> In the end I kind of understand it as similar to sqlmesh, but with a "BYO compute" approach? So where sqlmesh wants to run on a Data Warehouse platform that provides compute, and only really supports Iceberg via Trino, bauplan is focused solely on Iceberg and defining/providing your own compute resources?

Philosophically, yes. In practice so far we manage the machines in separate AWS accounts _for_ the customers, in a sort of hybrid approach, but the idea is not dissimilar.

> Should I understand therefore that this is only usable with an account from bauplanlabs.com ?

Yep. We’d help you get started and use our demo team. Send jacopo.tagliabue@bauplanlabs.com an email

RE: pricing. Good question. Early startup stage bespoke at the moment. Contact your friendly neighborhood Bauplan founder to learn more :)

So there's no self-hosted option?

I think currently the docs are lacking some context if you arrive there via a link rather than via your SaaS home page

Thanks a lot for the feedback, point taken!

Wrt deployment: the system has a control plane (on Bauplan AWS, never see any data, just auth and metadata), and data planes for customers (single tenant, private link, Soc2 compliant and all that).

If by hosting you mean "move the data plane to my cloud", that is entirely possible but not as recommended as the managed offering: in the end, the only dependency we have are off-the-shelf VMs in which we install our binary - and your bucket of course, but that is yours.

If you mean "installing the control plane on my cloud", that is not in the cards at the moment, unless a very special deployment is needed.

My suggestion - before complex deployment discussion - is always super simple: try it for free on public datasets and decide if you like it; running the quick start takes three minutes, just send over your email for access.

If you do like it, we can have a discussion on deployment, which has never been a blocker before.

so like the Lambda funcs in the examples - do I deploy those myself to my own infra? or they have to be defined using Serverless framework and get deployed to Bauplan-controlled infra? are they in the control plane or the data plane?

Just trying to understand how it all fits together

Sorry for the confusing example.

So, the AWS lambda in the data product example is a bit of a red herring, and it's used as the outer process to create branches and launch bauplan pipelines through the Python client (https://github.com/BauplanLabs/data-products-with-bauplan/bl...).

It can be your laptop, an Airflow task, a prefect flow or a step function or a cron job on a VM - it's the "host" process (for the data product we picked lambda because it's the easiest way for people to "run small Python stuff every 5 minutes" - this is a prefect example: https://www.prefect.io/blog/prefect-on-the-lakehouse-write-a...).

When you interact with the Bauplan lakehouse, all the compute happen on bauplan, nothing happens in the lambda: think of launching a Snowflake query from a lambda - the client is in the lambda but all the work is done in the SF cloud. Unlike many (all?) other lakehouses, Bauplan is code-first, so you can program the entire branching and merging patterns with a few lines of code, offloading the runtime to the platform.

The platform itself runs on standard EC2, which contains the dockerized functions needed for execution - typically we manage Ec2 in single tenant, private link, soc2 compliant account we own for simplicity, but nothing prevents the VMs to be somewhere else (given connectivity is ok etc.). It is our philosophy that you should not worry about the infra part of it, so even in case of BYOC we will be in charge of managing that.

Does it help clarify the mental model?

It is a service, not an open source tool, as far as I can tell. Do you intend to stay that way? What is the business model and pricing?

I am a bit concerned that you want users to swap out both their storage and workflow orchestrator. It's hard enough to convince users to drop one.

How does it compare to DuckDB or Polars for medium data?

- Yes. it is a service and at least the runner will stay like that for the time being.

- We are not quite live yet, but the pricing model is based on compute capacity and it is divided in tiers (e.g. small=50GB for concurrent scans=$1500/month, large can get up to a TB). infinite queries, infinte jobs, infinite users. The idea is to have a very clear pricing with no sudden increases due to volume.

- You do not have to swap your storage - our runner comes to your S3 bucket and your data never ever have to be anywhere else that is not your S3.

- You do not have to swap your orchestrator either. Most of our clients are actually using it with their existing orchestrator. You call the platform's APIs, including run from your Airflow/Prefect/Temporal tasks https://www.prefect.io/blog/prefect-on-the-lakehouse-write-a...

Does it help?

Yep, staying service.

RE: workflow orchestrators. You can use the Bauplan SDK to query, launch jobs and get results from within your existing platform, we don’t want to replace entirely if it’s doesn’t fit for you, just to augment.

RE: DuckDB and Polars. It literally uses DuckDB under the hood but with two huge upgrades: one, we plug into your data catalog for really efficient scanning even on massive data lake houses, before it hits the DuckDB step. Two, we do efficient data caching. Query results and intermediate scans and stuff can be reused across runs.

More details here: https://www.bauplanlabs.com/blog/blending-duckdb-and-iceberg...

As for Polars, you can use Polars itself within your Python models easily by specifying it in a pip decorator. We install all requested packages within Python modules.

In what kinds of workloads or usage patterns do you see the biggest performance gains vs traditional FaaS + storage stacks?
In a nutshell, data and AI workloads require fast re-building and vertical scaling:

1) you should not need to redeploy a Lambda if you you're running January and February vs only January now. In the same vein, you should not need to redeploy a lambda if you upgrade from pandas to polars: rebuilding functions is 15x faster than lambda, 7x snowpark (-> https://arxiv.org/pdf/2410.17465)

2) the only way (even in popular orchestrators, e.g. Airflow, not just FaaS) to pass data around in DAGs is through object storage, which is slow and costly: we use Arrow as intermediate data format and over the wire, with a bunch of optimizations in caching and zero-copy sharing to make the development loop extra-fast, and the usage of compute efficient!

Our current customers run near real-time analytics pipelines (Kafka -> S3 / Iceberg -> Bauplan run -> Bauplan query), DS / AI workloads and WAP for data ingestion.

I have really enjoyed the conversations I have had with Jacopo and Ciro over the years. They have really revisited a lot of assumptions behind commonly used tools/infrastructure in the data space and build something that really has a much better developer experience.

So excited to see them take this step!

Thanks @sbpayne <3
How does this compare to dbt? Seems like it can do the same?
Some similarities, but Bauplan offers:

1. Great Python support. Piping something from a structured data catalog into Python is trivial, and so is persisting results. With materialization, you never need to recompute something in Python twice if you don’t want to — you can store it in your data catalog forever.

Also, you can request anything Python package you want, and even have different Python versions and packages in different workflow steps.

2. Catalog integration. Safely make changes and run experiments in branches.

3. Efficient caching and data re-use. We do a ton of tricks behind to scenes to avoid recomputing or rescanning things that have already been done, and pass data between steps with Arrow zero copy tables. This means your DAGs run a lot faster because the amount of time spent shuffling bytes around is minimal.

To me they seem like the pythonic version of dbt! Instead of yaml, you write Python code. That, and a lot of on-the-fly computations to generate an optimized workflow plan.
Plenty of stuff in common with dbt's philosophy. One big thing though, dbt does not run your compute or manage your lake. It orchestrate your code and pushes it down to a runtime (e.g. 90% of the time Snowflake).

This IS a runtime.

You import bauplan, write your functions and run them in straight into the cloud - you don't need anything more. When you want to make a pipeline you chain the functions together, and the system manages the dependencies, the containerization, the runtime, and gives you a git-like abstractions over runs, tables and pipelines.

I see, this is a great answer. So you don't need any platform or spark or anything. Just storage and compute?
You technically just need storage (files in a bucket you own and control forever).

We bring you the compute as ephemeral functions, vertically integrated with your S3: table management, containerization, read / write optimizations, permissions etc. is all done by the platform, plus obvious (at least to us ;-)) stuff like preventing you to run a DAG that is syntactically incorrect etc.

Since we manage your code (compute) and data (lake state through git for data), we can also provide full auditing with one liners: e.g. "which specific run change this specific table on this data branch? -> bauplan commit ..."

the big question i have is — where is the code executed? “the cloud”? who’s cloud? my cloud? your environment on AWS?

the paper briefly mentions “bring your own cloud” in 4.5 but the docs page doesn’t seem to have any information on doing that (or at least none that i can find).

The code you execute on your data currently runs in a per-customer AWS account managed by us. We leave the door open for BYOC based on the architecture we’ve designed, but due to lean startup life, that’s not an option yet. We’d definitely be down to chat about it