Hacker News new | ask | show | jobs
by lurker458 2220 days ago
I've also been looking for that. In an ideal world there would be a small, fast, standalone cli tool that can convert csv to parquet. There is a (sadly, unfinished) parquet writer Rust library in the Arrow repository that looks promising. All approaches I've tried so far (spark, pyarrow, drill, ...) require everything and the kitchen sink. So far I've settled on a java cli tool that uses jackson + org.apache.parquet internally, but it's cpu bound and has a huge amount of maven dependencies.
1 comments

pandas + fastparquet fairly lightweight. but yes I would love to see a simple c++/golang binary that's just a simple csv2parq call.
Newer versions of Pandas don't even need fastparquet anymore. This code works:

import pandas as pd

df = pd.read_csv('data/us_presidents.csv')

df.to_parquet('tmp/us_presidents.parquet')

Nice! Does that work alongside reading in via chunks and writing via row_groups? If I have a 500GB CSV will it work?