Hacker News new | ask | show | jobs
by ryanwaldorf 746 days ago
Fair callouts. For the first version I was looking to solve the basic use case of "I've downloaded a standard CSV from one database and I want to ingest it into another". I'll look at adding the ability to handle non-standard CSVs as well.

Also I appreciate you pointing out what Snowflake and BigQuery have with this. I'm wondering if there's something Redshift-specific that may be useful here for larger file volumes. It is more performant to do a CSV copy, but it can be a PITA when it comes to having a role with the right permissions to copy data from the right location.

Anything else you'd recommend I look into?

1 comments

Yeah I'm not sure about redshift, but bigquery uses "autodetect" so something like

bq load --autodetect --source_format=CSV mydataset.mytable ./myfile.csv

And snowflake uses INFER_SCHEMA I believe you can do this

select * from table( infer_schema( location=>'@stage/my file.csv', file_format=>'my_csv_format' ) );

Although tbh I'm not sure if that's what you're looking for. You might enjoy looking at duckdb for stuff like this. My policy when starting data engineering was to bung everything into pandas dataframes, and now my policy is to try to avoid them at all costs because they're slow and memory hungry!

In ClickHouse it is just `INSERT INTO t FROM INFILE 'data.csv.gz'`. Any supported format, any encryption, autodetected from file name and sample data piece to get column types, delimeters etc. Separate tools to convert CSV are not necessary if you can just import to db and export as SQL Statements.

echo "name,age,city John,30,New York Jane,25,Los Angeles" > example.csv

clickhouse local -q "SELECT * FROM file('example.csv') FORMAT SQLInsert" INSERT INTO table (`name`, `age`, `city`) VALUES ('John', 30, 'New York'), ('Jane', 25, 'Los Angeles');