Hacker News new | ask | show | jobs
by jakozaur 923 days ago
It's a cool idea, but it looks incomplete for the production use case.

1. Usually, you want to run some warehouse all the time. Bring their data through ETL, run transformation, and report. This goes against the local environment. Ideally, I would love a cloud warehouse, which each engineer could easily fork to their laptop.

2. Almost all companies already have some data setup. The migration path is very unclear. Most likely, this is a secondary system for the majority of companies. Ideally, I would love to describe how I can use it along big platforms (e.g. BigQuery or Snowflake).

6 comments

I could add more but on a newish local stack I was trying: I spent a good amount of hours on duckdb this week to process personal data from data dumps (social networks, etc) and now I'm back to the combo of postgresql in containers + sqlite.

After the initial imports and some massaging with queries that felt awesome, I found hard to step-up the game building the relationships I wanted. The last stroke before switching was the lack of managing foreign keys without recreating entire tables again. I can go over other examples.

It can be done, but it just takes you out of the flow when you're analyzing the data + cleaning it, specially because I know that I can do with psql and sqlite in a blink of an eye.

Since many etl tools don't care about the target database being these old and trusty fellas, I felt losing a lot of time just to get rid of a postgres install that is right now consuming only 200 mb of ram on a docker/podman container. Or working around some sqlite ingestion issues with simple notebooks + pandas/polars/etc.

in my pov it seems a shaky ground for an entire new stack

I appreciate duckdb taking me out of the comfort zone tho.

Hi, Aleks here, one of the authors, and thank you very much for your comment

We run this stack in production for the last few months, and it has its downsides (I would argue due to the young ecosystem) and upsides, which we try to explain in the blog. We wanted to concentrate more on the concepts and technology change/improvement that allows us to run such a stack and explain how we see the future steps forward.

1. Running a warehouse is not a bad idea, but you must always be careful to separate the storage from the compute to scale. I experienced the limitation of such a system as described in this blog https://delta.io/blog/2022-09-14-why-migrate-lakehouse-delta... -> tough position if your solution is good but not scale. Ideally run it with external tables in order that data is visible without engine access

Another limitation is the metastore for your tables and metadata, which you usually have per workspace/environment in such a scenario. Databricks' unity catalog is an excellent way to solve it, but it is only compatible with some engines.

2. We do not think that this stack exists to exchange the snowflake or big query but to take a part of the workload away ( data transformation) and let PaaS solutions be good at what they are made for -> user interface and interaction.

Hi,

Georg - one of the authors is here.

What we argue is that:

- for a great software/data engineering/creation experience we recommend such a stack that is only on when needed (when transformations occur)

- for a great data consumption experience we suggest the integration with an established PaaS platform. Not only for the sake of being available (as a serving layer of data to end users) but also for the missing fine-grained RBAC in the proposed transformation layer

Definite[0] is the cloud version of this idea (data stack in a box). We have ETL, modeling, a data warehouse (Snowflake), and BI (dashboards) in one app.

We're experimenting with using DuckDB as the warehouse. Would be awesome to let people pull down parts of their warehouse locally for testing.

0 - https://www.definite.app/

Isn't that case easily handled by something like a large stateless VM and cloud storage, so EC2 + S3 ? Doesn't have to be local, the point is that it doesn't have to be distributed either, just one large instance that is only on when it's needed.
what stops you from running this on a container and dumping the results with a script to be queried by a reporting solution into bigquery or snowflake?
Hi,

Georg - one of the authors is here.

In fact, this is exactly what we argue: Keep data the consumer experience high a Paas platform (Fabric, BigLake, Databricks, SF, ...) can make a lot of sense, whereas for the best data development/creator experience a high-code solution based on solid software engineering best practices should be the preferred solution - at least in my/our(authors) opinion.

I already see companies with BigQuery setuped.

I would like to quickly grab some data, queries and dashboards and run them in my local warehouses.

If you use bigquery in big lake mode - i.e. with parquet or iceberg or delta files on GCS (or other object store) you can easily pull in the data into duckdb as well