Hacker News new | ask | show | jobs
by m_ke 1365 days ago
This might end up being the best way to etl postgres tables to parquet. From everything else that I tried, doing a copy to CSV and then converting to parquet was the fastest but can be a pain when dealing with type conversions.
4 comments

It is very easy with ClickHouse. All you need is:

SELECT ... FROM postgresql(...) FORMAT Parquet

And you can run this query without installing ClickHouse, using the clickhouse-local command-line tool.

It can be downloaded simply as:

curl https://clickhouse.com/ | sh

I didn't know DuckDB could read parquet natively[0]. Thanks for mentioning this. I recently had the displeasure of trying to deal with parquet files in nodejs and it was a disaster. I might try going through DuckDB instead.

[0]: https://duckdb.org/docs/data/parquet

This won't let you ETL from PG to parquet, but I used this in anger the other day https://github.com/manojkarthick/pqrs Worked quite well for my purposes!
I was trying https://github.com/sfu-db/connector-x and hacking around with this https://github.com/spitz-dan-l/postgres-binary-parser but it turned out that a COPY to csv using asyncpg and then converting to parquet was the fastest.
Ah. connector-x claims to be able to do a lot. Seems useful if it can actually deliver on that. Good to know that COPY with asyncpg is a quick approach.

In my case, I had parquet to begin with because I accidentally deleted some production data (oopsies) and when you export a snapshot from RDS to S3, it is in Parquet. Thankfully, I now have a few tricks up my sleeve to quickly restore data, but that was stressful for a bit haha

Exactly, I almost started writing a tool for that, because nothing else I found supports arrays and composite types. Now let's hope DuckDB does support that :)

If you don't use arrays and composites, Spark should be able to do it, right?

Yes, DuckDB supports writing lists and structs to Parquet! https://github.com/duckdb/duckdb/pull/2832

Does that help or do you have any other questions?

Awesome. However, the postgres_scanner has problems with jsonb, it won't even connect the database which contains those :/
Thank you for filing a bug, we will have a look at that I'm sure!
It's more of a feature request. Thank you :)