Hacker News new | ask | show | jobs
by brunoqc 637 days ago
I batch import XMLs, CSVs and mssql data into postgresql.

I'm pretty sure I could read them when needed with fdw. Is it a good idea?

I think it can be slow but maybe I could use materialized views or something.

3 comments

I've been using duckdb to import data into postgres (especially CSVs and JSON) and it has been really effective.

Duckdb can run SQL across the different data formats and insert or update directly into postgres. I run duckdb with python and Prefect for batch jobs, but you can use whatever language or scheduler you perfer.

I can't recommend this setup enough. The only weird things I've run into is a really complex join across multiple postgres tables and parquet files had a bug reading a postgres column type. I simplified the query (which was a good idea anyways) and it hums away

Thanks. My current pattern is to parse the files with rust, copy from stdin into a psql temp table, update the rows that have changed and delete the rows not existing anymore.

I'm hoping it's less wasteful than truncating and importing the whole table every time there is one single change.

It probably is, but you can do that workflow with DuckDB too. It's just really flexible since it can all be done in SQL. Check out the DuckDB postgres connection (https://duckdb.org/docs/extensions/postgres.html)

To import from a Salesforce Bulk Export CSV (for example) into a postgres table is a few lines of SQL

``` INSERT INTO pgdb.accounts ( id, created_at, updated_at ) SELECT "Id", "CreatedDate"::TIMESTAMP, "UpdatedDate"::TIMESTAMP FROM read_csv('/path/to/file.csv') WHERE "Id" NOT IN (SELECT id FROM pgdb.accounts) ```

The path can be in a cloud storage provider as well, which is really nice. You can do updates which join from the postgres db or the other CSV files (or a MySQL database) as well. The data transforms (casting to timestamp, uuid, etc.) have been super handy along with all the other SQL niceties that you get

“it depends”. Some considerations for mssql:

- If the foreign server is close (latency) that’s great

- if your query is complex then it helps if the postgres planner can “push down” to mssql. That will usually happen if you aren’t doing joins to local data

I personally like to set up the foreign tables, then materialize the data into a local postgres table using pg_cron. It’s like a basic ETL pipeline completely built into postgres

Oh. That is smart using it as a very simple ETL pipeline.
check out clickhouse. you might like it.