Hacker News new | ask | show | jobs
by mind-blight 637 days ago
I've been using duckdb to import data into postgres (especially CSVs and JSON) and it has been really effective.

Duckdb can run SQL across the different data formats and insert or update directly into postgres. I run duckdb with python and Prefect for batch jobs, but you can use whatever language or scheduler you perfer.

I can't recommend this setup enough. The only weird things I've run into is a really complex join across multiple postgres tables and parquet files had a bug reading a postgres column type. I simplified the query (which was a good idea anyways) and it hums away

1 comments

Thanks. My current pattern is to parse the files with rust, copy from stdin into a psql temp table, update the rows that have changed and delete the rows not existing anymore.

I'm hoping it's less wasteful than truncating and importing the whole table every time there is one single change.

It probably is, but you can do that workflow with DuckDB too. It's just really flexible since it can all be done in SQL. Check out the DuckDB postgres connection (https://duckdb.org/docs/extensions/postgres.html)

To import from a Salesforce Bulk Export CSV (for example) into a postgres table is a few lines of SQL

``` INSERT INTO pgdb.accounts ( id, created_at, updated_at ) SELECT "Id", "CreatedDate"::TIMESTAMP, "UpdatedDate"::TIMESTAMP FROM read_csv('/path/to/file.csv') WHERE "Id" NOT IN (SELECT id FROM pgdb.accounts) ```

The path can be in a cloud storage provider as well, which is really nice. You can do updates which join from the postgres db or the other CSV files (or a MySQL database) as well. The data transforms (casting to timestamp, uuid, etc.) have been super handy along with all the other SQL niceties that you get