| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by olalonde 2733 days ago
	Reminds me of this HN submission: https://news.ycombinator.com/item?id=16809963 Apparently CSV is actually quite hard to parse.

3 comments

mirimir 2733 days ago

It's not that CSV is hard to parse. It's that there's no guarantee that you'll get proper CSV. For example, it may literally be "CSV", without quotes. And that's fatal if values contain commas. I've even seen CSV with values that contain ``","``!

link

gregmac 2733 days ago

RFC4180 [1] standardizes CSV, but there are many implementations that don't read this 100% and unfortunately even more (including an extremely popular spreadsheet application) that don't write it.

If you are including CSV functionality in something you work on, please read and follow this (tiny) spec!

[1] https://tools.ietf.org/html/rfc4180

link

blattimwind 2733 days ago

Whether Excel writes standard CSV or not depends on the user's locale settings. E.g. with a German locale you get a semicolon (;) as a separator which you can only change system-wide. However, apart from the changed seperator, it's still standard CSV, which still works with Python or SQLite (.separator ; .import foo.csv foo).

A bigger issue is that Excel tends to write large numbers in scientific notation, which is a common issue handling price lists. E.g. it'll turn EAN numbers into 6.2134e+11, losing most of the number. Then you have to go back to the XLS file and change the column type into text and exporting it again as CSV. As this is lossy you can't fix it when receiving such a file.

Something like the SQL Server Import/Export Wizard but being able to write SQLite files would be very handy.

link

sonofgod 2733 days ago

Fun fact - we noticed SQLite wasn't RFC-compliant for it's CSV output (it used native lineendings, not CRLF, which is mandatory).

It is fixed now... but I'm now wondering whether SQLite wasn't more correct in the first place...

link

skrause 2733 days ago

The worst part about CSV is the locale-specific format used by some programs. For example when exporting to CSV a German Excel will use the comma as the decimal separator in numbers and a semicolon for the value separator. So you need to specify the exact format for each individual file you want to import.

link

amyjess 2733 days ago

Back when I worked at an NLP company, my boss used to say that he was "morally opposed to CSV".

When sharing tabular data in text format, he always preferred TSV because commas were everywhere in the material we were working with, but tab characters were really rare.

link

shabbyrobe 2733 days ago

I've always been curious about the characters in ASCII for this, but I've never seen them used in the wild. Stuff like "Group Separator" (0x1D), "Record Separator" (0x1E) or "Unit Separator" (0x1F)

Is there a reason why nobody uses these? Did someone work out back in the 90s they were pure evil and we've just never used them since?

link

mikestew 2733 days ago

https://www.lammertbies.nl/comm/info/ascii-characters.html

Those codes were originally for just such a purpose, but as others point out there was a bootstrapping problem. At this point, if Excel doesn’t support it, it’s not going to gain traction.

link

coredog64 2733 days ago

About 10 years ago I had to use those for a financial system integration. I was getting files that had been created on a mainframe, and whoever wrote it originally had the foresight to use those characters. Probably because they were based out of EMEA and understood that commas weren’t useful across national borders.

link

GnarfGnarf 2733 days ago

The codes between x01 and x1F were designed for telecommunications, for instance binary custom formats over RS-232 or synchronous protocols.

There is an excellent description in "C Programmer's Guide to Serial Communications" by Joe Campbell.

link

euos 2733 days ago

There’s no physical key for them. You can trivially write/edit CSV or TSV on any computer using any editor.

link

mirimir 2733 days ago

Indeed. I used to clean up data in UltraEdit. With the option for direct disk read/write set, there was virtually no file-size limit.

link

mirimir 2733 days ago

When I was working as a forensic data analyst, I felt pretty much the same. If you request CSV in discovery, there's no telling what you'll get. I mean, the data may come from custom COBOL, and then get reviewed by someone using Excel.

So yes, TSV. However, I've seen TSV with spurious tabs :(

Sometimes I ended up pushing for |-delimited data. Or even fixed-width format :)

link

kijin 2733 days ago

> I've even seen CSV with values that contain ``","``!

Are you sure it wasn't an injection attempt of some kind?

link

mirimir 2733 days ago

It could have been, I suppose.

But more likely is twisted creativity. It seems that some businesses are still using ancient systems, based on COBOL, AS/400, etc. There's resistance to changing legacy code. So when business changes require additional data fields, fields sometimes get subdivided. So a field that originally contained stuff like |foo| now contains stuff like |"foo","bar,baz"| or whatever. That works, because there's nothing like CSV in the data system. But when someone tries a CSV export, you get garbage.

link

cafard 2732 days ago

Newlines in the middle of a quoted field will cause problems for a lot of tools. And Python's csv.DictReader gives an error when a delimited file has too few fields.

But one can usually work around this.

link

BurningFrog 2733 days ago

It's hard enough that you should always use the library that handles the 5 weird cases, of which you'll only think of 3.

I made the mistake once and learned.

link

eecc 2733 days ago

In several companies, the interview assignment is indeed to implement a CSV parser.

link