Hacker News new | ask | show | jobs
by ahalbert 1087 days ago
I love using Awk, the only thing I miss is that it can't handle complex csv files. Does anyone know how to handle quoted CSV strings like

> "foo","bar,baz"

6 comments

I like the idea of Unix pipelines, but I hate all the sublanguages, awk being one of the biggest. I scratched my itch and built my own shell, marcel: https://github.com/geophile/marcel.

I mention this specifically, here, because of the CSV point. Marcel handles CSV, e.g. "read --csv foobar.csv" reads the foobar.csv file, parses the input (getting quotes and commas correct), and yields a stream of Python tuples, splitting each line of the CSV into the elements of the output tuples.

Marcel also supports JSON input, translating JSON structures into Python equivalents. (The "What's New" section of marcel's README has more information on JSON support, which was just added.)

If quoted string is the only thing you need to handle extra (i.e. no escaped quotes, newlines, etc) and if you have GNU awk:

    $ echo '"foo","bar,baz"' | awk -v FPAT='"[^"]*"|[^,]*' '{print $1}'
    "foo"
    $ echo '"foo","bar,baz"' | awk -v FPAT='"[^"]*"|[^,]*' '{print $2}'
    "bar,baz"
For a more robust solution, see https://stackoverflow.com/q/45420535 or use other tools like https://github.com/BurntSushi/xsv
I wanted to ask why not the more simple form:

echo '"foo","bar,baz","boo"' | awk -F"\",\"" '{print $1}' "foo

echo '"foo","bar,baz","boo"' | awk -F"\",\"" '{print $2}' bar,baz

echo '"foo","bar,baz","boo"' | awk -F"\",\"" '{print $3}' boo"

Realizing that I have to strip the quotes that remain.

Edit. formatting.

EDit, again, from your link, the following is more terse and too my taste (still needs strips):

awk -v FPAT='("[^"]*")+'

I usually use this awk function to parse CSV in awk:

    # This function takes a line i.e. $0, and treats it as a line of CSV, breakin
    # it into individual fields, and storing them in the passed in field array. It
    # returns the number of fields found, 0 if none found. It takes account of CSV
    # quoting, and also commas within CSV quoted fields, but doesn't remove them
    # from the parsed field.
    # use in code like:
    #   number_of_fields = parse_csv_line($0, csv_fields)
    #   csv_fields[2]  # get second parsed field in $0
    function parse_csv_line(line, field,   _field_count) {
      _field_count = 0
      # Treat each line as a CSV line and break it up into individual fields
      while (match(line, /(\"([^\"]|\"\")+\")|([^,\"\n]+)/)) {
        field[++_field_count] = substr(line, RSTART, RLENGTH)
        line = substr(line, RSTART+RLENGTH+1, length(line))
      }
      return _field_count
    }
It's not perfect but gets the job done most of the time and works across all awk implementations.
Convert it with Miller first:

    mlr --icsv --otsv cat examplefile
* https://miller.readthedocs.io/en/latest/10min/
Yes, this is what csvquote does. It does nothing else, just this so that programs like awk, sed, cut, etc. can work properly.

https://github.com/dbro/csvquote

They are planning built-in support for that, see that other comment https://news.ycombinator.com/item?id=36518146