Hacker News new | ask | show | jobs
by socratic 5274 days ago
This presentation brings up a tangential point that has always confused me: how error-prone is starting a subprocess, really?

I agree with the author's goals of making common tasks easier and more obvious. urllib2 is an easy target, as it was added to the standard library over a decade ago, long before REST was something people talked about. The best tools for packaging, versioning, and testing have always been a bit ambiguous in any language, including Python.

However, the author points out something that has always bothered me about Python: it is way harder to start a subprocess with an external command in Python than almost any other language. This has been true whether using sys or os or even subprocess, which is quite recent.

I always felt that this had something to do with the constant warnings in the documentation about how a pipe between the subprocess and the Python process might fill and cause the subprocess to block. Or how running the program through shell rather than exec or something might cause some sort of security issue. Are these real issues that other languages ignore in the name of user convenience, or has Python just never been able to make the right API (as the author seems to argue)?

4 comments

Creating a subprocess can be complex, at least if you expose all the different subtleties. If you've ever used Java's APIs to run processes, you know that Python's aren't the worst ;-)

There are lots of interesting corner cases, for example how to join stdout and stderr properly without blocking on one stream while the other is overflowing.

On the other hand, almost nobody ever needs this. Ruby's "output = `command`" probably covers 90% of the use cases with the most trivial API imaginable. The hard part obviously is exposing the advanced functionality without compromising on the simplicity.

Almost all programming communities can learn a lot from Ruby's "if it's too hard, you're not cheating enough" approach (dhh quote I believe). Yes, the process could return an exabyte of stdout data, but do you really care? Is that really the problem this API should try to solve, with all special cases? That's not good computer science practice, but surprisingly effective.

The sad or happy truth is that thanks to advances in computing power, what used to be dummy toy programming is now not only a valid way of doing things but the correct one.

Using made up stats, slurping the entire output of a process in a big string would fail 99% of the times 30 years ago, 50% of the times 20 years ago, 1% of the times 10 years ago, but less than 0.01% of the time now. You'd waste your time doing it 'the right way'.

So it is with most simple data processing. If your goal is just to ship a product fast, you no longer need the old type of smart programmers; nowadays, smart means doing it fast and badly.

It's funny you mention that. The author/speaker wrote a "Subprocesses for humans" module, too: https://github.com/kennethreitz/envoy

There's no fundamental problem that's stopped Python from doing this before. For some reason, all of the ways to spawn a subprocess in Python have tried to map almost directly to the underlying C API... which is pretty awful.

    > For some reason, all of the ways to spawn a subprocess
    > in Python have tried to map almost directly to the
    > underlying C API
I think both are good and necessary. One of the strengthes of python is that if you have a copy of Stevens you can usually work out how to do something in Python. And this is awesome. I've written things on top of unix that in times past would have been written in C. However, that mechanism is usually not very "pythonic".

In the early days python had a principle that there should be one way to do things. You don't hear this so much any more: we're long past that now, with some things different between 2.6 and 2.7 (arg handling), and with multiple broken libraries in the stdlib. When you're working on your own computer and your own time with root access you can always hand-roll outcmes. But it's common to have to deal with a spread of python's and cater to the most obsolete version. Yet I suspect some people still aspire to the one-way-to-do-it, and pretend it's true.

I think we should dump the principle.

A good example of why compromise is not the right outcome is the curses library - it's not quite Stevens, but it's not friendly either. It's hard to do good work with the curses library. We'd be better off if there was (1) a close curses mapping to the C ncurses mechanisms and (2) a nice-to-use abstraction layer that hid far more away from you.

To get around the problem of child subprocesses spewing out too much output and blocking the parent process, one can provide an open file handle to the stdout/stderr arguments of the Popen call. I've ran into this many times and this solution has reliably worked for me every time. This could be documented better in the Python docs.

For quick tasks and scripts, I've found subprocess.check_call, and subprocess.check_output with shell=True are great tools for spawning subprocesses and quickly grabbing output. They're pretty straightforward to use.

I have never been able to figure out how - in Python - to be able to stream asynchronously both stdout and stderr from the subprocess, both printing both of them as well as writing the data to a file.
I'm using the mkfifo method on linux/macosx:

    import os
    import sys
    import time
    import subprocess

    # turn off stdout buffering. otherwise we won't see things like wget progress-bars that update without newlines.
    sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)

    pipename = "tempfile"

    if os.path.exists(pipename):
        os.remove(pipename)

    # create a pipe. one side is connected to the ping process, other side is connected to python.
    os.mkfifo(pipename)
    read_fd = os.open(pipename, os.O_RDONLY|os.O_NONBLOCK)
    writer = open(pipename, "w+")

    proc = subprocess.Popen("ping www.google.com", cwd=sys.path[0], stdout=writer, stderr=writer, shell=True)

    while 1:
        try:
            # nonblocking poll data from the external process.
            s = os.read(read_fd, 1024)
            if s:
                sys.stdout.write(s)
        except OSError:
            pass
        # sidenote: minimum sleep time is 1/64 seconds on many windows pc-s.
        time.sleep(0.1)

    # remember to remove the pipe "tempfile"
Replying to myself. Using mkfifo is not necessary:

    import os, sys, time, subprocess, fcntl
    sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)
    read_fd, write_fd = os.pipe()
    fcntl.fcntl(read_fd, fcntl.F_SETFL, os.O_NONBLOCK) # don't know of any windows equivalent for this line
    proc = subprocess.Popen("ping www.google.com", cwd=sys.path[0], stdout=write_fd, stderr=write_fd, shell=True)
    while 1:
        try:
            s = os.read(read_fd, 1024)
            if s:
                sys.stdout.write(s)
        except OSError:
            pass
        time.sleep(0.1)
You're listening for two file descriptor events, so you need some sort of event loop. select can do it but it's low-level; and since there can be only one event loop per program, your choices are frameworks and not simply libraries.

Here's a way to do it with Twisted (docs here: http://twistedmatrix.com/documents/current/core/howto/proces... ):

  from twisted.internet import reactor, protocol

  class PrintAndLogProtocol(protocol.ProcessProtocol):
      def outReceived(self, data):
          # print and log
      errReceived = outReceived

  reactor.spawnProcess(PrintAndLogProtocol(),
       '/path/to/exe', ['exe', 'arg1', 'arg2'])
  reactor.run()
I've done this using select.