Hacker News new | ask | show | jobs
by rkaveland 822 days ago
Author here. I see now that the title is too controversial, I should have toned that down. As I mention in the conclusion, if you're giving parquet files to your user and all they want to know is how to turn it into Excel/CSV, you should just give them Excel/CSV. It is, after all, what end users often want. I'm going to edit the intro to make the same point there.

If you're exporting files for machine consumption, please consider using something more robust than CSV.

4 comments

Well what would be a more accurate title? "CSV format should only be for external interchange or archival; columnar formats like Parquet or Arrow better for performance"?

People are busy; instead of hinting "something more robust than CSV", mention the alternatives and show a comparison (load time/search time/compression ratio) summary graph. (Where is the knee of the curve?)

There's also an implicit assumption to each use-case about whether the data can/should fit in memory or not, and how much RAM a typical machine would have.

As you mention, it's pretty standard to store and access compressed CSV files as .csv.zip or .csv.gz, which mitigates at least trading off the space issue for a performance overhead when extracting or searching.

The historical reason a standard like CSV became so entrenched with business, financial and legal sectors is the same as other enterprise computing; it's not that users are ignorant; it's vendor and OS lock-in. Is there any tool/package that dynamically switches between formats internally? estimates comparative file sizes before writing? ("I see you're trying to write a 50Gb XLSX file...") estimates read time when opening a file? etc. Those sort of things seem worth mentioning.

> Well what would be a more accurate title? "CSV format should only be for external interchange or archival; columnar formats like Parquet or Arrow better for performance"?

Something more boring, like "Consider whether other options make more sense for your data exports than CSV". Plenty of people have suggested other good options in comments on this submission, such as for example sqlite. I think the post comes off as if I'm trying to sell a particular file format for all uses cases, when what I had in mind when writing it was to discourage using CSV as a default. CSV has a place, certainly, but it offloads a lot of complexity on the people who are going to consume the data, in particular, they need to figure out how to interpret it. This can't necessarily be done by opening the file in an editor and looking at it, beyond a certain size you're going to need programming or great tools to inspect it anyway.

I was given an initial export of ~100 poor quality CSV files totaling around 3TB (~5-6 different tables, ~50 columns in each) in size a few years back, and had to automate ingestion of those and future exports. We could've saved a lot of work if the source was able to export data in a friendlier format. It happened more than once during that project that we were sent CSVs or Excel sheets that had mangled data, such as zip codes or phone numbers with leading 0s removed. I think it is a good thing to inform people of these problems and encourage the use of formats that don't necessity guessing data types. :shrug:

> People are busy; instead of hinting "something more robust than CSV", mention the alternatives and show a comparison (load time/search time/compression ratio) summary graph. (Where is the knee of the curve?)

This might be an interesting thing to follow up later, but would require a lot more work.

> I see now that the title is too controversial, I should have toned that down.

Sometimes a click-baity title is what you need to get a decent conversation/debate going. Considering how many comments this thread got, I'd say you achieved that even if sparking a lengthy HN thread had never been your intent.

Congratulations for getting the article upvoted and don't be too hard on yorself.
I got a parquet file once and I was like WTF is this format?

The problem with parquet is it's complicated and you basically have to remap from parquet to whatever you're importing into because the people on the other side have remapped from whatever to parquet.

There are likely relationships and constraints there that you'll have to hack around - which is harder to do because the parquet tools sort of suck/aren't as flexible.

With CSV you can hack around any problem in the ETL process.