Hacker News new | ask | show | jobs
by aasasd 1366 days ago
Any experienced programmer learns to not use string processing on structured data, because that will bite them in the ass.

Meanwhile HN luddites: let me use awk, cut and whatnot despite the existence of an util that explicitly sidesteps this issue.

2 comments

/me runs the example on bigbash.it, cleaned up a bit:

    (
      trap "kill 0" SIGINT;
      export LC_ALL=C;
      find movies.dat.gz -print0
        | xargs -0 -i sh -c "gzip -dc {} | tail -n +2"
        | sed "s/::/;/g"
        | cut -d $';' -f2
        | sort -t$';'  -k 1,1
        | head -n10
        | awk -F ';' '{print $1}'
    )
Yeah, how about no. That's a very neat site and a clever hack, but there are clear escaping flaws in there for valid movie names.

bash and standard unix tools are a terrible structured-data manipulator. it's part of why `jq` is so widely used and loved, despite being kinda slow and hard to remember at times - it does things correctly, unlike most glued-together tools.

Yep, pretty sure that this script doesn't handle quoted strings in any way, and would promptly mangle a title that contains a semicolon.
"structured data" usually means there are delimiting characters, states, etc. AWK can certainly handle this. It's a simple and powerful language.

I don't think I've ever used it to parse JSON, but I've definitely used it to output simple JSON.

Are you telling me that awk can correctly identify delimiters inside quoted strings? Escaped quotes inside quoted strings? Newlines inside quoted strings? I.e. that awk actually has a csv parser? Very cool if so.
Yeah, you can implement a basic FSM and use `next` to handle fake `RS` (e.g. newlines).

I'm not necessarily recommending it, but it's certainly possible and could be portable and really fast to run with a low memory footprint.

Well, awk having a csv parser via the user implementing that parser is not quite what I have in mind when I turn to awk for some quick field splitting—and I don't think it's what others in the thread meant either, as evidenced by the linked site.

Personally I prefer using a readymade and tested library in any language that I might touch, so I can just do my own thing on top. Or, in command line, to use an util that employs such a library. Kind of hope that I'm never so constrained that only awk is available and I can't even spin up Lua.