Hacker News new | ask | show | jobs
by matheusmoreira 1560 days ago
I also had higher expectations after reading the title and was disappointed when I realized it was about failure to handle all possible system call results. I thought it was gonna be a bug in the C standard library or something.

I still agree with the author though. This is a serious matter and it seems most of the time the vast amount of complexity that exists in seemingly simple functionality is ignored.

Hello world is not "simply" calling a text interface API. It is asking the operating system to write data somewhere. I/O is exactly where "simple" programs meet the real world where useful things happen and it's also where things often get ugly.

Here's all the stuff people need to think about in order to handle the many possible results of a single write system call on Linux:

  long result = write(1, "Hello", sizeof("Hello") - 1);

  switch (result) {
    case -EAGAIN:
      /* Occurs only if opened with O_NONBLOCK. */
      break;
    case -EWOULDBLOCK:
      /* Occurs only if opened with O_NONBLOCK. */
      break;
    case -EBADF:       
      /* File descriptor is invalid or wasn't opened for writing. */
      break;
    case -EDQUOT:
      /* User's disk quota reached. */
      break;
    case -EFAULT:
      /* Buffer points outside accessible address space. */
      break;
    case -EFBIG:
      /* Maximum file size reached. */
      break;
    case -EINTR:
      /* Write interrupted by signal before writing. */
      break;
    case -EINVAL:
      /* File descriptor unsuitable for writing. */
      break;
    case -EIO:
      /* General output error. */
      break;
    case -ENOSPC:
      /* No space available on device. */
      break;
    case -EPERM:
      /* File seal prevented the file from being written. */
      break;
    case -EPIPE:
      /* The pipe or socket being written to was closed. */
      break;
  }
Some of these are unlikely. Some of these are irrelevant. Some of these are very important. Virtually all of them seem to be routinely ignored, especially in text APIs.
2 comments

And specifically no space left on device is a very common error that is also very commonly handled badly. Happened to me yesterday and the error messages I got were unhelpful or non-existent. In Firefox part of a website I was desperately trying to use just stopped reacting for some functionality. Developer tools opened as a blank space. Importing a calendar entry in Evolution produced an inscrutable SQLite error. Starting Chromium (as backup browser in the hopes that the website would work better there) via Gnome did not open any window or show any error. It was only when I tried to start Chromium via the console that I saw a helpful error message for the first time.

Also I always start to mildly panic in such cases, as lots of software corrupts its on-disk state more when the hard drive is full than any segfault, OOM-kill or hard shutdown is able to. I can understand and empathize on how this happens from a software development perspective, but objectively speaking "our entire field is bad at what we do, and if you rely on us, everybody will die". ( https://xkcd.com/2030/ )

Userspace should not expect that any given syscall can only return some set of known errno values. You should enumerate the cases where you want to do some kind of special handling (with EINTR being somewhat more important that other cases) and have path to somehow handle even unexpected errno values.

Both Linux man pages and SUS specify some set of possible error situations, but not all of them. In the man pages case the set is not at all fixed and is subject to change and often does not contain some of the more obscure error states. The SUS "Errors" section are explicitly not meant to be complete and the OS can return additional errno values, additionally the OS can even handle some of the error cases as undefined behavior and not return any error code at all (notable example: doing anything to already joined pthread_t on linux, whish is undefined and does not return -ESRCH).

You're right. The manual contains this ominous notice at the very end of the errors section:

https://man7.org/linux/man-pages/man2/write.2.html

> Other errors may occur, depending on the object connected to fd.

I don't understand why every possible result isn't explicitly documented. This is the Linux system call interface, we need to know everything that could happen when we make these calls.

The right assumption is that every syscall can return any defined errno value. In practice this means that you should handle the cases that you have to handle (-EINTR and for write(2) incomplete writes, which are typical reason for “fatal error: Success”), that you can somehow handle (things like retries for -ENOSPC) and log strerror(3) result for anything that you don't expect (whether you shoult then abort(), exit() or continue depends on how critical the failed syscall was).