Hacker News new | ask | show | jobs
by mirimir 2733 days ago
It's not that CSV is hard to parse. It's that there's no guarantee that you'll get proper CSV. For example, it may literally be "CSV", without quotes. And that's fatal if values contain commas. I've even seen CSV with values that contain ``","``!
5 comments

RFC4180 [1] standardizes CSV, but there are many implementations that don't read this 100% and unfortunately even more (including an extremely popular spreadsheet application) that don't write it.

If you are including CSV functionality in something you work on, please read and follow this (tiny) spec!

[1] https://tools.ietf.org/html/rfc4180

Whether Excel writes standard CSV or not depends on the user's locale settings. E.g. with a German locale you get a semicolon (;) as a separator which you can only change system-wide. However, apart from the changed seperator, it's still standard CSV, which still works with Python or SQLite (.separator ; .import foo.csv foo).

A bigger issue is that Excel tends to write large numbers in scientific notation, which is a common issue handling price lists. E.g. it'll turn EAN numbers into 6.2134e+11, losing most of the number. Then you have to go back to the XLS file and change the column type into text and exporting it again as CSV. As this is lossy you can't fix it when receiving such a file.

Something like the SQL Server Import/Export Wizard but being able to write SQLite files would be very handy.

Fun fact - we noticed SQLite wasn't RFC-compliant for it's CSV output (it used native lineendings, not CRLF, which is mandatory).

It is fixed now... but I'm now wondering whether SQLite wasn't more correct in the first place...

The worst part about CSV is the locale-specific format used by some programs. For example when exporting to CSV a German Excel will use the comma as the decimal separator in numbers and a semicolon for the value separator. So you need to specify the exact format for each individual file you want to import.
Back when I worked at an NLP company, my boss used to say that he was "morally opposed to CSV".

When sharing tabular data in text format, he always preferred TSV because commas were everywhere in the material we were working with, but tab characters were really rare.

I've always been curious about the characters in ASCII for this, but I've never seen them used in the wild. Stuff like "Group Separator" (0x1D), "Record Separator" (0x1E) or "Unit Separator" (0x1F)

Is there a reason why nobody uses these? Did someone work out back in the 90s they were pure evil and we've just never used them since?

https://www.lammertbies.nl/comm/info/ascii-characters.html

Those codes were originally for just such a purpose, but as others point out there was a bootstrapping problem. At this point, if Excel doesn’t support it, it’s not going to gain traction.

About 10 years ago I had to use those for a financial system integration. I was getting files that had been created on a mainframe, and whoever wrote it originally had the foresight to use those characters. Probably because they were based out of EMEA and understood that commas weren’t useful across national borders.
The codes between x01 and x1F were designed for telecommunications, for instance binary custom formats over RS-232 or synchronous protocols.

There is an excellent description in "C Programmer's Guide to Serial Communications" by Joe Campbell.

There’s no physical key for them. You can trivially write/edit CSV or TSV on any computer using any editor.
Indeed. I used to clean up data in UltraEdit. With the option for direct disk read/write set, there was virtually no file-size limit.
When I was working as a forensic data analyst, I felt pretty much the same. If you request CSV in discovery, there's no telling what you'll get. I mean, the data may come from custom COBOL, and then get reviewed by someone using Excel.

So yes, TSV. However, I've seen TSV with spurious tabs :(

Sometimes I ended up pushing for |-delimited data. Or even fixed-width format :)

> I've even seen CSV with values that contain ``","``!

Are you sure it wasn't an injection attempt of some kind?

It could have been, I suppose.

But more likely is twisted creativity. It seems that some businesses are still using ancient systems, based on COBOL, AS/400, etc. There's resistance to changing legacy code. So when business changes require additional data fields, fields sometimes get subdivided. So a field that originally contained stuff like |foo| now contains stuff like |"foo","bar,baz"| or whatever. That works, because there's nothing like CSV in the data system. But when someone tries a CSV export, you get garbage.

Newlines in the middle of a quoted field will cause problems for a lot of tools. And Python's csv.DictReader gives an error when a delimited file has too few fields.

But one can usually work around this.