Hacker News new | ask | show | jobs
by pkaler 4896 days ago
> This propensity for today’s working programs to be broken tomorrow is what I mean when I say these languages are not future proof.

It doesn't matter. This is not how programming works in the real world. In the real world, you write the most correct program you can under time pressure. A new compiler, operating system, or platform arrives that exposes a bug. You fix it and you move on. It doesn't matter if the language is future proof or not. The process is similar for any complex program.

The blog's name is "Embedded in Academia" and this is perfectly valid viewpoint for someone in academia to take. And people in academia should research towards building more robust tools and languages. But it really is not going to matter in the real world. Languages and platforms will always not be future proof because computing is complex.

4 comments

The particular kind of not-future-proofness he has in mind seems pretty practically important: code that relies on this undefined behavior often suffers from exploitable security holes. Just because computing is complex doesn't mean you have a free pass if you shoot yourself (or your customers) in the foot the same way the previous 100 folks did. If it happens enough, it becomes prudent to do something about it, like people finally did about unsanitized format strings, or the use of unbounded sprintf().

His suggestion #3, that the standards should define more of the commonly used behavior and leave less of it undefined, wouldn't even require C programmers to do anything about it themselves.

> His suggestion #3, that the standards should define more of the commonly used behavior and leave less of it undefined, wouldn't even require C programmers to do anything about it themselves.

I've written Windows, Mac, Linux, Xbox, PlayStation, PSP, iOS, and Android code. The memory model is subtly different for each platform. I just don't think you can define certain behaviour and have that work across disparate platforms.

I haven't really written any device drivers or kernel space code but I would imagine it would make the job even more difficult.

You underestimate how much undefined behaviour is in typical C programs and how little of it is yet taken advantage of by compilers.

The compilers are now starting to fairly radically rewrite the original code in ways the author would not recognize, simply because of some undefined behaviour exists within the code. You need to be increasingly language lawyerly to avoid the compiler outsmarting you, almost as if it was a hostile opponent.

The read of an uninitialized variable in the article was a good example.

The problem is that programmers have a mental model of how the C they write turns into machine code, and that model is increasingly out of date in the search for more performance. The compiler is becoming less predictable, in precisely the way that we argue against "sufficiently smart compilers" in the past for languages at a higher level than C - that you wouldn't be able to predict when the smart compiler was smart enough to optimize your high-level construct. Now you're increasingly unable to predict what the compiler will turn your code into, unless you have a deeper understanding of the rules.

The "hostile opponent" analogy is a good one. C was always intended as a kind of higher-level replacement for assembly, so it was reasonable to assume (for instance) that uninitialized integer variables contain some unspecified but definite value, but recently compilers have been deliberately breaking those assumptions just because they can. It's almost reached the point where C isn't useful for its original purpose of systems programming; it's very hard to write threaded code that doesn't rely on undefined behaviour, for instance.
Ostensibly, a platform like Java or Rust is supposed to abstract stuff like the memory model. I haven't written a lot of Java code, especially not Java code that runs on many different native system / VMs, but from my perspective of blissful ignorance, it seems to have done the job?

Same with other high-level VM based languages like Python...

For most programs, yes they have done their job. However, in certaint categories of applications (for example, server software) it's somewhat of a leaky abstraction. Garbage collector sweeps, circular references, etc are all pains which force you to be aware of how the vm is managing your memory.

For an impression, see this excellent blog series on how a certain garbage collector sweep issue was solved: http://blog.cloudera.com/blog/2011/02/avoiding-full-gcs-in-h...

It works, but to get C to expose the same memory model on different platforms you would have to compromise the performance and close-to-the-hardware nature that are the only reason to use C nowadays.
And of course, there's not a lot of warranty your JVM/Python/Ruby/Other VM doesn't suffer from any of the quoted C/C++ issues
Python is not future proof.

There are undefined sequences even in Python, where Jython and CPython output different programs.

Small amounts of undefined behaviour are normal in most language specs though to give implementations flexibility. Tests to make sure you do not rely on them would be useful though.
Not to mention the GIL...
The guy behind "Embedded in Academia" knows quite a bit about the memory models supported by C, and had done some marvelous work regarding the testing of C compilers, C code and undefined behavior. If he claims it is possible to improve the situation and leave less behavior undefined, he's most probably right.
You certainly could define some basic things to make the language safer. For example, make variables always be initialized to zero if not explicitly initialized, and force accessing beyond the bounds of an array to be a fault rather than undefined behavior.
You could, but that comes at a cost. That's why libraries like the STL in C++ provide std::vector::operator[] and std::vector::at() - so the user can freely choose whether to pay the extra cost for the bounds check, or not. That's why C provides both malloc() and calloc() - so the user can freely choose whether memory is zero-initialized, or not.

One of the major design decisions for C/C++ is that you don't pay for what you don't use. This is what makes them so flexible and performant across a wide range of systems and applications, but also leaves these safety choices up to the user. Some languages make that tradeoff, but it's not always the right decision.

On the other hand there are languages where correctness comes before speed, and they still provide you the mechanisms to get speed if you really want.

For example, in the Pascal family of languages, you can always disable bounds checking or do pointer arithmetic if you really want to, but that should only be done if there is really the need to do so.

A problem with many C and C++ developers is that they suffer from premature optimization, thinking that we are still targeting PDP-11 like environments.

Initializing variables to zero doesn't buy you much in terms of safety, IMHO. The value 0 isn't necessarily any more valid than an arbitrary value. Better is Java/ML/Haskell's rule whereby variables must be explicitly initialized before use. This can be implemented with a simple compiler pass.
At least the value 0 is always the same and doesn't subtly change from one invocation to the next or from one machine to the next. It certainly helps in making programs more robust, even if there is still a problem at code level.
Java? Everything (except for built-in types) is nullable in Java...
pcwalton's point is that you must be explicit about initializing variables:

  int foo = 1;
  int bar;
  System.out.println(foo + bar);  // compile error: variable bar might not have been initialized
I agree that programmers should not take on the burden of supporting hypothetical future compiler optimizations (if that's what you're saying), but this problem could be reduced if compilers started forbidding undefined behavior — then programmers would only have to adapt once.
Much undefined behaviour can't be statically detected, unfortunately.
Is there a way to detect it dynamically, e.g. by running C code under a debug mode or in an interpreter that errors out when undefined behavior is encountered? I've occasionally wanted to have something like that to use in tests, so I could ensure that at least my common code paths aren't relying on undefined behavior. I know about gcc's -ftrapv and a few other options, but nothing comprehensive.
Besides the already mentioned:

- IOC : low overhead, only for integer overflows

- KCC : high overhead, for all kinds of undefined behavior, limited standard library support (and source-level only)

- Valgrind : medium overhead, for various memory errors, binary, may fail to detect undefined behaviors that have been made undetectable by compilation.

You may also find:

- various memory-safe C compilers. There are plenty here, I had better let you do the googling. medium overhead, generally better than Valgrind at being sound (since they work at source level), unless they trade efficiency for soundness: http://research.microsoft.com/pubs/101450/baggy-usenix2009.p... . May require all source code to be available.

- Frama-C's value analysis, a static analyzer that can be used as a C interpreter. This is what I work on. Limitations comparable to KCC, quite a bit faster (but still high overhead), some slightly different design choices. I do not have a good single write-up for this use, but some details are available at these URLs:

http://blog.frama-c.com/public/csmith.pdf

http://blog.frama-c.com/index.php?post/2011/08/29/CompCert-g...

I've heard of several:

http://embed.cs.utah.edu/ioc/ http://code.google.com/p/c-semantics/

Haven't used either in anger though.

Thanks! I'd run across the first one, but it's also only for the case of integer overflow. The 2nd is new to me, and looks quite comprehensive.
In theory, for sure. Valgrind can test for certain kinds of undefined behaviour - it runs the code in a special virtual machine.

You could also have the compiler insert checks. Obviously this isn't desirable for a lot of C projects by default, but (other than in places like kernel development etc.) it could be a nice debugging aid. I don't know of any good tools for doing this comprehensively.

But the the author of the original piece is mostly concerned with undefined behavior that can be detected statically - otherwise, compilers would not be able to exploit it to make optimizations.
One thing he mentions is signed integer overflow. This is in the worst case equivalent to the halting problem, but even in practice very hard to test for at compile time.

Another behaviour he mentions is not properly return'ing at the end of a non-void function. This is again technically equivalent to the halting problem, but it is negated by the good practice of making every code path (even potentially dead ones) have a return statement (or throw an exception, etc.) Go takes this approach if I remember correctly.

It can't always be tested for at compiler time but the problem he's complaining about is when C compilers do detect signed integer overflow. What happens is that someone writes code that in practice handles signed integer overflow fine, then a while later the C compiler developers get clever, detect the integer overflow, and decide to optimize that code away because it's invoking undefined behaviour and they can do whatever they like. The code in question is frequently security-critical, so by removing it the compiler converts safe code whose behaviour is technically undefined by the standard into a security vulnerability.
The common case is (probably) not that a compiler detects an instance of signed over/underflow. Instead, it can assume that this never happens and generate "dangerous" code.

A good post describing how these optimizations come about is http://www.airs.com/blog/archives/120

More options to warn about uses of or disable these optimizations would be welcome in compilers.

Not so. Assuming that a piece of code isn't doing anything undefined is a lot easier than detecting that it is.

That's generally how compilers take advantage.

For most large projects, you usually standardise the compilation environment for a specific release. Any issues for a newer version would be fixed when you make a newer release of your software. Especially for anything that is safety critical, like satellites or spaceships.
The software world does not solely consist of large, safety-critical projects.

Picture a single person or small team releasing an open-source project, it generates little developer interest and a community fails to start, and the original author(s) move on.

Fast forward 5 years or more. The code's floating around the internet, but nobody's left who understands it well enough to explain why it breaks with a modern toolchain. Requiring people to use a compiler -- and possibly an entire operating system -- of that age will deter people significantly from using that project.

In the real world, you write the most correct program you can under time pressure. A new compiler, operating system, or platform arrives that exposes a bug. You fix it and you move on. It doesn't matter if the language is future proof or not.

A new compiler, OS or platform will require much less rewriting of a Python program than a C program. Under time pressure, it is much more likely that you will incidentally write future-proof code if you write in Python instead of C.