Hacker News new | ask | show | jobs
by nmz 1739 days ago
CSV is a complicated format but that does not mean awk is incapable of dealing with it.

https://www.gnu.org/software/gawk/manual/html_node/Splitting...

https://github.com/e36freak/awk-libs/blob/master/csv.awk

https://raw.githubusercontent.com/Nomarian/Awk-Batteries/mas...

1 comments

> CSV is a complicated format

Surprisingly and unnecessarily so:

> ["DSV"] is to Unix what CSV (comma-separated value) format is under Microsoft Windows and elsewhere outside the Unix world. CSV (fields separated by commas, double quotes used to escape commas, no continuation lines) is rarely found under Unix.

> In fact, the Microsoft version of CSV is a textbook example of how not to design a textual file format. Its problems begin with the case in which the separator character (in this case, a comma) is found inside a field. The Unix way would be to simply escape the separator with a backslash, and have a double escape represent a literal backslash. This design gives us a single special case (the escape character) to check for when parsing the file, and only a single action when the escape is found (treat the following character as a literal). The latter conveniently not only handles the separator character, but gives us a way to handle the escape character and newlines for free. CSV, on the other hand, encloses the entire field in double quotes if it contains the separator. If the field contains double quotes, it must also be enclosed in double quotes, and the individual double quotes in the field must themselves be repeated twice to indicate that they don't end the field.

> The bad results of proliferating special cases are twofold. First, the complexity of the parser (and its vulnerability to bugs) is increased. Second, because the format rules are complex and underspecified, different implementations diverge in their handling of edge cases. Sometimes continuation lines are supported, by starting the last field of the line with an unterminated double quote — but only in some products! Microsoft has incompatible versions of CSV files between its own applications, and in some cases between different versions of the same application (Excel being the obvious example here).

The Art of Unix Programming http://www.catb.org/~esr/writings/taoup/html/ch05s02.html

> The latter conveniently not only handles the separator character, but gives us a way to handle the escape character and newlines for free. CSV, on the other hand, encloses the entire field in double quotes if it contains the separator. If the field contains double quotes, it must also be enclosed in double quotes, and the individual double quotes in the field must themselves be repeated twice to indicate that they don't end the field.

I KNOW how CSV works, for the most part. And my brain still started tuning out/stopped building up the mental model.

The quoting also helps preserve embedded non-printable characters, newlines, etc. (yes, which can appear).

One extension of the "Unix version" would be to impose a requirement like that in JSON, where all non-printable and/or non-ASCII characters must be written as an escape sequence like "\uXXXX" escape.

This is why I hate CSV files. Trying to reformat huge blocks of data is a job that Awk does well. The associative arrays let you build structures that let you do the heavy lifting. For record processing, Awk should be one of the first tools you look at.