Hacker News new | ask | show | jobs
by redsaz 1362 days ago
A satisfied user here. Found it very useful when tools like cut and sort weren't enough, usually when I need to do a join on two different tables (err, files). Left joins work, but I don't think right joins are supported.

I've used this in combination with jq as well. I'll use jq to convert json to CSV, and then use SQL to do whatever else.

6 comments

Cut, sort, join and awk can be pretty powerful and fast. If it becomes too tedious to manually write them, you can also use BigBash [1] to convert a SQL query automatically to a one-liner that only use these tools to execute the query.

[1] http://bigbash.it

Any experienced programmer learns to not use string processing on structured data, because that will bite them in the ass.

Meanwhile HN luddites: let me use awk, cut and whatnot despite the existence of an util that explicitly sidesteps this issue.

/me runs the example on bigbash.it, cleaned up a bit:

    (
      trap "kill 0" SIGINT;
      export LC_ALL=C;
      find movies.dat.gz -print0
        | xargs -0 -i sh -c "gzip -dc {} | tail -n +2"
        | sed "s/::/;/g"
        | cut -d $';' -f2
        | sort -t$';'  -k 1,1
        | head -n10
        | awk -F ';' '{print $1}'
    )
Yeah, how about no. That's a very neat site and a clever hack, but there are clear escaping flaws in there for valid movie names.

bash and standard unix tools are a terrible structured-data manipulator. it's part of why `jq` is so widely used and loved, despite being kinda slow and hard to remember at times - it does things correctly, unlike most glued-together tools.

Yep, pretty sure that this script doesn't handle quoted strings in any way, and would promptly mangle a title that contains a semicolon.
"structured data" usually means there are delimiting characters, states, etc. AWK can certainly handle this. It's a simple and powerful language.

I don't think I've ever used it to parse JSON, but I've definitely used it to output simple JSON.

Are you telling me that awk can correctly identify delimiters inside quoted strings? Escaped quotes inside quoted strings? Newlines inside quoted strings? I.e. that awk actually has a csv parser? Very cool if so.
Yeah, you can implement a basic FSM and use `next` to handle fake `RS` (e.g. newlines).

I'm not necessarily recommending it, but it's certainly possible and could be portable and really fast to run with a low memory footprint.

Well, awk having a csv parser via the user implementing that parser is not quite what I have in mind when I turn to awk for some quick field splitting—and I don't think it's what others in the thread meant either, as evidenced by the linked site.

Personally I prefer using a readymade and tested library in any language that I might touch, so I can just do my own thing on top. Or, in command line, to use an util that employs such a library. Kind of hope that I'm never so constrained that only awk is available and I can't even spin up Lua.

Powerful and fast, and also *portable*. It'll run on your low-privileged tools box, 20 year old beige box, vhost, you name it.
In addition to the usual cut/paste/sort/awk stuff, we've had really powerful "stream-operator" databases based on flat text data files for decades. They used to be somewhat slow. Not anymore, esp when running from RAMdisks.

One good one is Strozzi NoSQL (his use of the term NoSQL predates the current use of the term by many years...): http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/NoSQL/Home%20P...

Starbase is another, with interesting extensions for astronomical work.

Linux Review article on the concept here: https://www.linuxjournal.com/article/3294

The article that started it all: http://www.linux.it/~carlos/nosql/4gl.ps

And there's even a book on the subject, centered on the /rdb implementation by the late RSW software. But I warn you, reading this WILL permanently change the way you think about databases: https://www.amazon.com/Relational-Database-Management-Prenti...

Depending on how complex the task is, I also jump from sort/join/awk/sed to the SQL train more often. But if I have already gone this step, then I would also like to have the whole SQL(ite) power and that would then but really blow up the command line. In such cases I usually write a TCL script, the integration of SQLite3 [0] is quasi native and besides the full SQLite3 functionality I also have flexible extensions (e.g. directly usable TCL procedures of any complexity) at my disposal. Tools like Q represent a middle ground, although they build on SQLite they remain behind in functionality [1]. But as long as I want to keep it simple on the command line while adhering to the UNIX philosophy, coreutils, sed, awk and possibly perl remain my best friends.

[0]: https://www.sqlite.org/tclsqlite.html [1]: http://harelba.github.io/q/#limitations

Same. You can go a long way with cut, sort, etc. and also awk with its pattern matching. But if you're handy with SQL, that can often feel more natural and certainly things like joins among separate CSV files, as well as sums and other aggregates, are easier.

If you have "unclean" CSV data, e.g. where the data contains delimiters and/or newlines in quoted fields, you might want to pipe it through csvquote.

https://github.com/dbro/csvquote

You also probably have GNU join installed: https://www.gnu.org/software/coreutils/manual/html_node/join...
Have you looked at the Unix command ‘join’? This is a cool tool, but I think join is pretty much everywhere.
Hmm, I'll have to check it out!