Go has replaced all my 'scripting' needs. Way easier to deploy static binaries and I get static typing for my scripts. Maintaining a python install on multiple servers and OSes is a nightmare
I admit I have never truly learned awk outside of the most dead simple stuff, but one of the most useful python utilities I have ever written is below. It allows you to use python lambdas on lines of stdin.
Example usage is:
lambda "x0 + x1 * x2" int int int
Code:
import sys
# libs I use commonly...
import random, itertools, re
# parse the types of the columns
types = map(eval, sys.argv[2:])
# craft a lambda with the args x0, x1, ... xN
f = eval("lambda " + ','.join("x"+str(i) for i in xrange(len(types))) + ":" + sys.argv[1])
# apply lambda to stdin, don't print results of None
for line in sys.stdin:
args = []
for t, e in zip(types, line.strip().split()):
args.append(t(e))
result = print f(*args)
if result != None:
print result
Examples:
Where data.txt is
john doe 37
jane doe 35
jack bob 20
bill bob 40
And I do
cat data.txt | lambda "x0 if x2 < 36 else x1" str str int
It will output
doe
jane
jack
bob
You can use this sort of tool for a million things. e.g. sample 1 out of every 1000 lines:
lambda "x0 if random.random() <= .001 else None" str
It's probably the same power and whatnot as awk, but I know and am much more familiar with python, so it's useful for me.
$ awk '{if ($3 < 36) print $1; else print $2}' data.txt
doe
jane
jack
bob
Python has a lot of strengths and is better than awk at a lot of things, but one-liner column based text processing on the command line is literally awk's bread and butter.
I made a similar tool in Perl. It eval's a Perl command passed as string, on each line of stdin. In the middle I can go wild with regexes and hash/dictionaries and whatnot. It's one of my most used tools.
Instead of writing scripts for each little task, I just write one-liners. When they become more than 2 lines long it becomes unwieldy and I switch it to a regular script.
I grew up data munging with awk/sed/tail/head manipulations, avoiding Perl 5 (not out of any antipathy, but I just worked off of what my fathers bookshelf had and there was no Camel book on there). Back in the 90s we'd publicly post our dotfiles to our Apache 1.3 servers amongst peers, but (perhaps this was just a component of the IRC community I was a part of) we didn't share much of scripts-sets we built up overtime. The furthest we'd go is "I want to do foo", someone with more knowledge than you would give you a series of invocation parameters and over time you'd osmotically acquire enough knowledge to be the one dispensing knowledge.
From what I hear, Perl 5 is prime for the problem set you defined, but I've never seen any aggregated resource of people's Perl5 munging scripts. Do us all a favor and post a Github Gist of that tool (along with common invocations of you going wild with regexes and hashmaps). If you're feeling overly generous, post the source of the commonly used regular scripts as well.
nice one. small, sharp tools in the unix toolbox. given that this is the type of processing one might routinely over a file with tens of millions of lines, what's your guess on the difference in performance between the python snippet above your your awk one-liner? (my guess is about 100x in favor of awk)
I honestly have no idea. Given that there are different versions of awk and different versions of python, I'm not even sure there is an answer.
Given awk's age (1977-when computers were much slower and memory much more expensive) and pedigree (Aho, Weinberger, and Kernighan), I wouldn't bet against it for a task like you describe, but that's just a feeling. Again, I don't have any numbers to support that.
My gut-ballpark was going to be around an order of magnitude, not two. Here's are two naive comparisons (granted, from the late aughts and not a direct comparison but a more general one) that show ~7-8x[0,1]. The overhead of PyStringObject is not trivial[2] (though the implementation details likely have changed between Py2.x and Python 3).
For things like building accumulators a set of data/log parsing rather than data munging (hits per hour or enumerative tasks), I'd imagine (g|n)?awk might hit your 100x since you'd just grab the fd and traverse being IO bound. I'm not sure how awk does it, but if it's just saving an accumulator value (or 10) in a register. Assuming x64-64 treats, say, an r3 fetch analogously to a fetch to ecx (err..rcx now I guess), rather than having to keep a full object in L1 cache, awk has a huge advantage.
---
N.b., if you're benching tasks like this, don't use 'time' and STDOUT and think you're getting real performance numbers. Your bottleneck (terminals can only render $x lines a minute, so the kernel call to write(STDOUT, ....) will be where you choke, not at the language. Also disk fragmentation would be another issue. Put the both the test file and the output file on a RAMdisk.) Cache flush with Something like sync; echo 3 > /proc/sys/vm/drop_caches (on Linux, I forget the BSD way of doing it) then `time benchmark.py /mnt/ramdisk1/file > /mnt/ramdisk2' over multiple runs, under various loads, with different data sets, etc
Another interesting thing to note is that comp arch is so advanced (I was reading a paper on formal verification of ISAs, and apparently even 12 dollar ARMs now have out-of-order instruction execution type stuff) that between the kernel scheduler and CPU optimizations, Python will likely benefit much more from disk-seek latency (effectively allowing PyStringObject allocation to occur while you're waiting for /dev/sd$n to return).
I'm certainly not an authority on rigorous benchmarks though - someone like Brendan Gregg please jump in!