|
|
|
|
|
by zlurker
1003 days ago
|
|
We orchestrate our ETL pipelines with dagster. We only use duckdb in a few of them but are slowly replacing pandas etls with it. For some of our bigger jobs we use spark instead. Essentially it's:
1. Data sources from places such as s3, sftp, rds
2. Use duckdb to load most of these with only extensions (I dont believe there's one for sftp, so we just have some python code to pull the files out.)
3. transform the data however we'd like with duckdb.
4. convert the duckdb table to pyarrow
5. Save to s3 with delta-rs FWIW, we also have this all execute externally from our orchestration on an EC2 instance. This allows us to scale vertically. |
|
Last time I checked duckdb didn't have the concept of a metastore so do you have an internal convention for table locations and folder structure ?.
What do you use for reports/visualizations? notebooks ?.