| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ahalbert 1087 days ago
	I love using Awk, the only thing I miss is that it can't handle complex csv files. Does anyone know how to handle quoted CSV strings like > "foo","bar,baz"

6 comments

geophile 1087 days ago

I like the idea of Unix pipelines, but I hate all the sublanguages, awk being one of the biggest. I scratched my itch and built my own shell, marcel: https://github.com/geophile/marcel.

I mention this specifically, here, because of the CSV point. Marcel handles CSV, e.g. "read --csv foobar.csv" reads the foobar.csv file, parses the input (getting quotes and commas correct), and yields a stream of Python tuples, splitting each line of the CSV into the elements of the output tuples.

Marcel also supports JSON input, translating JSON structures into Python equivalents. (The "What's New" section of marcel's README has more information on JSON support, which was just added.)

link

asicsp 1087 days ago

If quoted string is the only thing you need to handle extra (i.e. no escaped quotes, newlines, etc) and if you have GNU awk:

    $ echo '"foo","bar,baz"' | awk -v FPAT='"[^"]*"|[^,]*' '{print $1}'
    "foo"
    $ echo '"foo","bar,baz"' | awk -v FPAT='"[^"]*"|[^,]*' '{print $2}'
    "bar,baz"

For a more robust solution, see https://stackoverflow.com/q/45420535 or use other tools like https://github.com/BurntSushi/xsv

link

poetaster 1086 days ago

I wanted to ask why not the more simple form:

echo '"foo","bar,baz","boo"' | awk -F"\",\"" '{print $1}' "foo

echo '"foo","bar,baz","boo"' | awk -F"\",\"" '{print $2}' bar,baz

echo '"foo","bar,baz","boo"' | awk -F"\",\"" '{print $3}' boo"

Realizing that I have to strip the quotes that remain.

Edit. formatting.

EDit, again, from your link, the following is more terse and too my taste (still needs strips):

awk -v FPAT='("[^"]*")+'

link

wmwragg 1086 days ago

I usually use this awk function to parse CSV in awk:

    # This function takes a line i.e. $0, and treats it as a line of CSV, breakin
    # it into individual fields, and storing them in the passed in field array. It
    # returns the number of fields found, 0 if none found. It takes account of CSV
    # quoting, and also commas within CSV quoted fields, but doesn't remove them
    # from the parsed field.
    # use in code like:
    #   number_of_fields = parse_csv_line($0, csv_fields)
    #   csv_fields[2]  # get second parsed field in $0
    function parse_csv_line(line, field,   _field_count) {
      _field_count = 0
      # Treat each line as a CSV line and break it up into individual fields
      while (match(line, /(\"([^\"]|\"\")+\")|([^,\"\n]+)/)) {
        field[++_field_count] = substr(line, RSTART, RLENGTH)
        line = substr(line, RSTART+RLENGTH+1, length(line))
      }
      return _field_count
    }

It's not perfect but gets the job done most of the time and works across all awk implementations.

link

JdeBP 1086 days ago

Convert it with Miller first:

    mlr --icsv --otsv cat examplefile

* https://miller.readthedocs.io/en/latest/10min/

link

dbro 1086 days ago

Yes, this is what csvquote does. It does nothing else, just this so that programs like awk, sed, cut, etc. can work properly.

https://github.com/dbro/csvquote

link

lysium 1087 days ago

They are planning built-in support for that, see that other comment https://news.ycombinator.com/item?id=36518146

link