Hacker News new | ask | show | jobs
by sklarsa 1088 days ago
Personally, I use xsv and it’s been tremendously helpful, especially when working with larger files. https://github.com/BurntSushi/xsv
2 comments

xsv is great for a quick sanity checks (i.e. number of columns, unique values counts in a given column) but for a more serious tasks/giant files I switch to either polars or duckdb converting CSV/TSV files to parquet or parquet data sets.

By giant I mean 25G gzipped files with >10^9 rows like these VCFs: https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/

Maybe try the to [1] and sqlp [2] commands from the qsv fork. From the README:

sqlp: Run blazing-fast Polars SQL queries against several CSVs - converting queries to fast LazyFrame expressions, processing larger than memory CSV files.

to: Convert CSV files to PostgreSQL, SQLite, XLSX, Parquet and Data Package.

[1] https://github.com/jqnatividad/qsv/blob/master/src/cmd/sqlp....

[2] https://github.com/jqnatividad/qsv/blob/master/src/cmd/to.rs...

I shared the qsv fork [1] yesterday which is more active.

xsv is more lean while qsv tries to support every action that you might want to perform on CSV files.[2]

[1] https://github.com/jqnatividad/qsv

[2] https://github.com/jqnatividad/qsv/discussions/290#discussio...