| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by akbo 2485 days ago

These tools probably do a good job at processing CSV/TSV/DSV (haven't tried them). However, I would love if we could just stop using delimiter-separated value files alltogether.

Why? Because the file format is an underspecified, unsafe mess. When you get this kind of a file you have to manually specify its schema when reading it because the file doesn't contain it. Also, due to its underspecification, there are many unsafe implementations that produce broken files that cannot be read without manual fixing in a text editor. Let's just start using safe, well-specified file formats like AVRO, Parquet or ORC.

As a data scientist, I have had lots of issues because the data I got for a project was a CSV/TSV/DSV file. I recently spat out a rant on this topic, so if you want more details, check out https://haveagooddata.net/posts/why-you-dont-want-to-use-csv...

2 comments

burntsushi 2485 days ago

This is addressed in the xsv readme: https://github.com/BurntSushi/xsv#motivation

We can't stop, because it's a de facto standard format for exchange with spreadsheet programs. So long as that's ubiquitous, we might as well write tools to make processing them easier.

Also, I'm not sure why you called CSV unsafe. It's certainly the case that it's severely under-specified, but I don't think there's anything unsafe about it.

link

akbo 2485 days ago

> Also, I'm not sure why you called CSV unsafe.

One example of it being unsafe that happened to me: I got a CSV file written by a program with a broken implementation of a CSV writer that didn't quote string fields when there was a newline in them (in my case only the first half of a newline: carriage return). Then I read the file with a broken implementation of a CSV reader that assumed that the carriage return meant a new record and filled both parts of the broken line with N/As instead of throwing an error. This way the data in the sink didn't match the data in the source. This is the loss of data integrity, which I would call unsafe. It doesn't happen if you have a file format that serializes your data safely.

Due to the format being underspecified, many people roll their own unsafe CSV writer or CSV reader, thus every CSV file (where you don't completely control the source) is potentially broken.

Edit: Browsing your Github account I found that you implemented a CSV parser in Rust. I didn't know that when I wrote the above comment, so I was definitely not trying to imply that your particular CSV parser is unsafe.

link

heavenlyblue 2484 days ago

What makes you think that if people manage to misimplement CSV parsers and generators they are not going to misimplement other formats? At least with CSV it’s always easy to implement some sort of heuristic that splits the rows correctly.

The only times when I had to deal with the issues you describe I was supplied with the data from a literally dying company. They just didn’t give a damn. Changing the file formats wouldn’t change anything - they would still find a way to mess it up.

link

burntsushi 2482 days ago

Ah I see, yeah, from where I come from "unsafe" has a bit more weight to it. I'd call what you describe "silently incorrect." Which is also quite bad, to be fair!

link

mhd 2485 days ago

> These tools probably do a good job at processing CSV/TSV/DSV (haven't tried them). However, I would love if we could just stop using delimiter-separated value files alltogether.

I hear ya. I have no doubt that "we" could, as in IT professionals. Maybe even the "surrounding" science fields that provide data could make an effort. But you're out of luck when it comes to almost any other field that serves you data, in my experience.

Any org that you can't even tell details about the CSV you want (what's the "C"? UTF8? Quoting?) will have no chance of providing you with something more complex. It's partly our fault: The tools we provide them with suck. Excel's CSV handling is atrocious. Salesforce and similar tools seem to spit out barely consistent data dumps.

Sometimes I feel like 80% of the industry is dealing with sanitizing input.

link