Hacker News new | ask | show | jobs
by nop_slide 1493 days ago
Do you have some personal examples where Python wasn't suitable over bash for a non-trivial task?

I am admittedly a not very good bash'er, and this week at work I whipped up my first "complicated" bash script, but ended up rewriting it in Python. It might have just been the case that the IO bound nature of my task fell right into Python's strong suits (async IO), but it had me wondering what scenarios a bash script would be better.

Essentially I had to:

* iterate over a list of hundreds of thousands of files

* make an api call via aws cli

* take result and process it through a few shell utilities (`date` and `touch`) to then update the timestamps on the files.

I initially wrote it in bash which spawned ~16 background workers (threads/processes?) via `&` and blocking via `wait -n`. This "worked" but was pretty slow as the threads were thrashing checking for responses. Doing anything more than 16 caused my computer to crawl lol.

I then rewrote it in Python with async IO and the async subprocess API (to run shell commands) and it was an order of magnitude faster.

I wish I was better at bash, but maybe I just haven't spent enough time with it. Doing this task made me feel like I could pretty much doing most things in Python if I need a non-trivial script.

3 comments

I think it's less about pure suitability, and more about availability.

If I write a Bash script, I can be reasonably certain it will run on any unixoid system that came out in the last years. With Python, I am now in version hell (it's ridiculous how many Python2-first-servers I still find), I'll run the risk of includes suddenly becoming incompatible or buggy.

Can Bash do everything that Python can? Almost certainly not. But it is available, it is relatively simple (and almost minimalistic), and it forces you to learn more about standard unix tools.

Your specific use case ... well, I think that's pretty unusual - I'd wager if you had avoided the parallel processing and done it in a more linear way, you've had a better time.

That makes sense! I will admit it was a bit of fun clobbering together the bash version at first. It reminded me of starting out programming where everything felt a bit esoteric lol.

> if you had avoided the parallel processing and done it in a more linear way, you've had a better time.

I had to process ~500,000 files, and the aws api call was on average ~1 second, so it would have been a significantly longer time to process linearly. For example the bash version I whipped up processed ~30k files in 2.5 hours, while the python version did 30k files in ~20 minutes.

But yes I agree, if I didn't have to do such a large volume at once the bash version would have been just fine.

> If I write a Bash script, I can be reasonably certain it will run on any unixoid system that came out in the last years.

Except for, like, all the (free) BSDs and Illumos. What “unixoid” systems were you referring to, exactly, other than Linux?

Other than bash not being portable to even the exact same machine, due to having so many footguns. Creating a new file can potentially fail a previously working bash, because it may have used .txt expecting a single file, or whatever, and there are a million others.

Don’t get me wrong, I accept that there are legitimate use cases for bash, but I really can’t help but feel that anything longer than 3 lines (including #!) is better off in anything*.

This description, to me, sounds like you wrote a Python program... in shell. Nothing wrong with Python programs, but writing them in shell is about as pleasant as writing them in C.

The shell’s parallel processing capabilities shine when you want consecutive stream transformation steps to run in parallel (note that you can pipe to and from loops, over multiple descriptors if necessary). When your task can be structured like that, it will often take as much time to write a parallel implementation in shell as it will a serial one in Python (etc.). On the other hand, running multiple iterations of the same step in parallel is, as you’ve seen, awkward.

If the only thing you actually need to parallelize is HTTP requests, you might find piping things into aria2c[1] or the like to generate temporary files with responses and then processing those serially (best if your tool can tell you on stdout when a job is done, otherwise inotifywait and its ilk[2] may help if you’re uncomfortable just hammering the filesystem). When that is not enough, you could try GNU Parallel[3] or, if all else fails, implementing a jobserver in the style of GNU Make[4], which you may find more pleasant to do in shell than your current design. Generating a Makefile and then letting Make manage the worker pool (and call out to a separate shell script for each job) is also a possibility.

But at that point (in particular if you’re not I/O-bound) it might be best to just not do it in shell[5].

[1] https://aria2.github.io/manual/en/html/aria2c.html

[2] https://jvns.ca/blog/2020/06/28/entr/

[3] https://www.gnu.org/software/parallel/

[4] https://www.gnu.org/software/make/manual/html_node/POSIX-Job...

[5] https://sanctum.geek.nz/etc/emperor-sh-and-the-traveller.txt

Python isn't suitable (by default) when the main task is to run multiple other programs, in pipes and elsewhere. Env vars are a bit harder to get to, etc. Common sys-admin use cases basically.

Yes, you can write a run() function that handles everything, and there are some 90% solutions in subprocess now, but the defaults are not as simple as a modern shell on this dimension.

Of course, if you need to do any string manipulation, advanced conditionals, or math, Python quickly pulls ahead.