| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by OneOneOneOne 3615 days ago
	I wonder if Python's rise in popularity has reduced shell scripting? Python has even displaced AWK in my tool use.

2 comments

javier2 3614 days ago

Go has replaced all my 'scripting' needs. Way easier to deploy static binaries and I get static typing for my scripts. Maintaining a python install on multiple servers and OSes is a nightmare

link

Sukotto 3615 days ago

Can you give some brief examples of how you've done that?

I wouldn't mind if I never had to use awk again...

link

gamegoblin 3615 days ago

I admit I have never truly learned awk outside of the most dead simple stuff, but one of the most useful python utilities I have ever written is below. It allows you to use python lambdas on lines of stdin.

Example usage is:

    lambda "x0 + x1 * x2" int int int

Code:

    import sys
    # libs I use commonly...
    import random, itertools, re 

    # parse the types of the columns
    types = map(eval, sys.argv[2:])

    # craft a lambda with the args x0, x1, ... xN
    f = eval("lambda " + ','.join("x"+str(i) for i in xrange(len(types))) + ":" + sys.argv[1])

    # apply lambda to stdin, don't print results of None
    for line in sys.stdin:
        args = []
        for t, e in zip(types, line.strip().split()):
            args.append(t(e))
        result = print f(*args)
        if result != None:
            print result

Examples:

Where data.txt is

    john doe 37 
    jane doe 35
    jack bob 20
    bill bob 40

And I do

    cat data.txt | lambda "x0 if x2 < 36 else x1" str str int

It will output

    doe
    jane
    jack
    bob

You can use this sort of tool for a million things. e.g. sample 1 out of every 1000 lines:

    lambda "x0 if random.random() <= .001 else None" str

It's probably the same power and whatnot as awk, but I know and am much more familiar with python, so it's useful for me.

link

claystu 3615 days ago

To accomplish the same thing in awk:

  $ awk '{if ($3 < 36) print $1; else print $2}' data.txt 
  doe
  jane
  jack
  bob

Python has a lot of strengths and is better than awk at a lot of things, but one-liner column based text processing on the command line is literally awk's bread and butter.

link

visarga 3614 days ago

I made a similar tool in Perl. It eval's a Perl command passed as string, on each line of stdin. In the middle I can go wild with regexes and hash/dictionaries and whatnot. It's one of my most used tools.

Instead of writing scripts for each little task, I just write one-liners. When they become more than 2 lines long it becomes unwieldy and I switch it to a regular script.

link

iheartmemcache 3614 days ago

I grew up data munging with awk/sed/tail/head manipulations, avoiding Perl 5 (not out of any antipathy, but I just worked off of what my fathers bookshelf had and there was no Camel book on there). Back in the 90s we'd publicly post our dotfiles to our Apache 1.3 servers amongst peers, but (perhaps this was just a component of the IRC community I was a part of) we didn't share much of scripts-sets we built up overtime. The furthest we'd go is "I want to do foo", someone with more knowledge than you would give you a series of invocation parameters and over time you'd osmotically acquire enough knowledge to be the one dispensing knowledge.

From what I hear, Perl 5 is prime for the problem set you defined, but I've never seen any aggregated resource of people's Perl5 munging scripts. Do us all a favor and post a Github Gist of that tool (along with common invocations of you going wild with regexes and hashmaps). If you're feeling overly generous, post the source of the commonly used regular scripts as well.

link

claystu 3614 days ago

You should check out Minimal Perl [https://amzn.com/1932394508] if you haven't already done so.

link

gamegoblin 3615 days ago

Definitely.

I don't think of it as a better awk, just a drop-in to minimize my cognitive overhead. It's basically Python-flavored awk.

link

doug1001 3614 days ago

nice one. small, sharp tools in the unix toolbox. given that this is the type of processing one might routinely over a file with tens of millions of lines, what's your guess on the difference in performance between the python snippet above your your awk one-liner? (my guess is about 100x in favor of awk)

link

claystu 3614 days ago

I honestly have no idea. Given that there are different versions of awk and different versions of python, I'm not even sure there is an answer.

Given awk's age (1977-when computers were much slower and memory much more expensive) and pedigree (Aho, Weinberger, and Kernighan), I wouldn't bet against it for a task like you describe, but that's just a feeling. Again, I don't have any numbers to support that.

link

iheartmemcache 3614 days ago

My gut-ballpark was going to be around an order of magnitude, not two. Here's are two naive comparisons (granted, from the late aughts and not a direct comparison but a more general one) that show ~7-8x[0,1]. The overhead of PyStringObject is not trivial[2] (though the implementation details likely have changed between Py2.x and Python 3).

For things like building accumulators a set of data/log parsing rather than data munging (hits per hour or enumerative tasks), I'd imagine (g|n)?awk might hit your 100x since you'd just grab the fd and traverse being IO bound. I'm not sure how awk does it, but if it's just saving an accumulator value (or 10) in a register. Assuming x64-64 treats, say, an r3 fetch analogously to a fetch to ecx (err..rcx now I guess), rather than having to keep a full object in L1 cache, awk has a huge advantage.

---

N.b., if you're benching tasks like this, don't use 'time' and STDOUT and think you're getting real performance numbers. Your bottleneck (terminals can only render $x lines a minute, so the kernel call to write(STDOUT, ....) will be where you choke, not at the language. Also disk fragmentation would be another issue. Put the both the test file and the output file on a RAMdisk.) Cache flush with Something like sync; echo 3 > /proc/sys/vm/drop_caches (on Linux, I forget the BSD way of doing it) then `time benchmark.py /mnt/ramdisk1/file > /mnt/ramdisk2' over multiple runs, under various loads, with different data sets, etc

Another interesting thing to note is that comp arch is so advanced (I was reading a paper on formal verification of ISAs, and apparently even 12 dollar ARMs now have out-of-order instruction execution type stuff) that between the kernel scheduler and CPU optimizations, Python will likely benefit much more from disk-seek latency (effectively allowing PyStringObject allocation to occur while you're waiting for /dev/sd$n to return).

I'm certainly not an authority on rigorous benchmarks though - someone like Brendan Gregg please jump in!

[0] https://diamondinheritance.blogspot.com/2008/04/awk-vs-pytho...

[1] https://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-...

[2] http://www.laurentluce.com/posts/python-string-objects-imple...

link

doug1001 3612 days ago

nice one--i learned a few things, in fact (& caused me to realize once again, just how sloppy we are with benchmarking on my team)

link