| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bdcravens 757 days ago
	Perhaps ease in troubleshooting? I know dealing with line endings, mystery encodings, etc can make CSV ingestion tedious.

1 comments

adammarples 757 days ago

This, as far as I can tell, completely ignores all of that. You're supposed to export the data yourself as csv before beginning, so it's up to you what character encoding, delimiter, escape characters, etc. you will be using. The csv is fed directly into "pd.read_csv" with no option to pass any arguments along in case you have a slightly non-standard csv format. This is a good tool if it solves this guy's problems but as a data engineer I would suggest to him to learn the tooling a little more. He makes reference to snowflake and bigquery and I know they both allow you to upload a csv into a table and auto detect or infer the schema very easily. That would be a better way than using pandas and copy paste.

link

ryanwaldorf 757 days ago

Fair callouts. For the first version I was looking to solve the basic use case of "I've downloaded a standard CSV from one database and I want to ingest it into another". I'll look at adding the ability to handle non-standard CSVs as well.

Also I appreciate you pointing out what Snowflake and BigQuery have with this. I'm wondering if there's something Redshift-specific that may be useful here for larger file volumes. It is more performant to do a CSV copy, but it can be a PITA when it comes to having a role with the right permissions to copy data from the right location.

Anything else you'd recommend I look into?

link

adammarples 756 days ago

Yeah I'm not sure about redshift, but bigquery uses "autodetect" so something like

bq load --autodetect --source_format=CSV mydataset.mytable ./myfile.csv

And snowflake uses INFER_SCHEMA I believe you can do this

select * from table( infer_schema( location=>'@stage/my file.csv', file_format=>'my_csv_format' ) );

Although tbh I'm not sure if that's what you're looking for. You might enjoy looking at duckdb for stuff like this. My policy when starting data engineering was to bung everything into pandas dataframes, and now my policy is to try to avoid them at all costs because they're slow and memory hungry!

link

qoega 756 days ago

In ClickHouse it is just `INSERT INTO t FROM INFILE 'data.csv.gz'`. Any supported format, any encryption, autodetected from file name and sample data piece to get column types, delimeters etc. Separate tools to convert CSV are not necessary if you can just import to db and export as SQL Statements.

echo "name,age,city John,30,New York Jane,25,Los Angeles" > example.csv

clickhouse local -q "SELECT * FROM file('example.csv') FORMAT SQLInsert" INSERT INTO table (`name`, `age`, `city`) VALUES ('John', 30, 'New York'), ('Jane', 25, 'Los Angeles');

link