This might end up being the best way to etl postgres tables to parquet. From everything else that I tried, doing a copy to CSV and then converting to parquet was the fastest but can be a pain when dealing with type conversions.
I didn't know DuckDB could read parquet natively[0]. Thanks for mentioning this. I recently had the displeasure of trying to deal with parquet files in nodejs and it was a disaster. I might try going through DuckDB instead.
This won't let you ETL from PG to parquet, but I used this in anger the other day https://github.com/manojkarthick/pqrs Worked quite well for my purposes!
Ah. connector-x claims to be able to do a lot. Seems useful if it can actually deliver on that. Good to know that COPY with asyncpg is a quick approach.
In my case, I had parquet to begin with because I accidentally deleted some production data (oopsies) and when you export a snapshot from RDS to S3, it is in Parquet. Thankfully, I now have a few tricks up my sleeve to quickly restore data, but that was stressful for a bit haha
Exactly, I almost started writing a tool for that, because nothing else I found supports arrays and composite types. Now let's hope DuckDB does support that :)
If you don't use arrays and composites, Spark should be able to do it, right?
SELECT ... FROM postgresql(...) FORMAT Parquet
And you can run this query without installing ClickHouse, using the clickhouse-local command-line tool.
It can be downloaded simply as:
curl https://clickhouse.com/ | sh