Hacker News new | ask | show | jobs
by barrkel 5570 days ago
So much of this is caused by unsigned types. They are evil; avoid them wherever you can.
3 comments

Care to elaborate? Unlike signed ones, unsigned integral types at least have well-defined behavior on shifting and overflow. (I'm speaking in terms C specifically here, of course.)
Signed ints are easier to range check at runtime. Given an unsigned int, it's difficult to detect an invalid result from combining or comparing signed and unsigned ints.

Google's C++ Style Guide discourages using unsigned ints to represent nonnegative numbers (like sizes or counts). It recommends using runtime checks or assertions instead.

http://google-styleguide.googlecode.com/svn/trunk/cppguide.x...

Unsigned ints make sense for bit twiddling, but you should probably use a fixed-size uint32_t or uint64_t to ensure the results are consistent across various architectures.

The "always use signed" rule is a source of endless debate in C circles. I personally like almost everything in the Google C++ Style Guide, but this is one place where I think they got it wrong.

The problem is that the riskiest place for a signed/unsigned mismatch is when calling an unsigned API with a signed value. Simply deciding to not use unsigned at all doesn't fix this because ANSI C and STL use unsigned types throughout (f.e. memcpy)

  if (size <= 10) {
    // Yay, I have plenty of space
    memcpy(buffer, src, size);
  }
The code looks fine, but if "size" is an int with the value -1 there's a hard-to-spot bug. Plenty of security holes have been caused by just this sort of mistake. If you don't fight against the types that libc uses you don't have this problem.

There will still be spots where you'll need to compare signed and unsigned values, but the compiler will warn you about these. You'll have to cast one side or the other but that's a GOOD thing. Since neither a signed-compare nor an unsigned-compare is always what you want you want to be explicit about it.

There are other advantages to using unsigned types. For instance, it gives an explicit hint to the person reading the code about the range of the value. I think this makes interfaces clearer. For instance if you see a function signature of "void foo(const uint8_t *, size_t)" you'll immediately guess that you're dealing with a memory buffer and its explicit size without even seeing the names of the parameters.

Actually, if I had my way "int" would default to being unsigned and you'd have to specifically request "signed" if that's what you want. I find that I probably use unsigned types 5x as often as signed ones.

"There are other advantages to using unsigned types. For instance, it gives an explicit hint to the person reading the code about the range of the value."

This is, without doubt, the worst reason for using unsigned types, and it's the primary reason (IMHO) for the flaws in the C API that force you to use unsigned types unnecessarily. Unsigned types are not a documentation feature, and they are not merely an advert for an invariant; they are opting in to a subtly different arithmetic that most people are surprised by. It would be better to have a range-checked types, like Pascal, than to infect the program with unsigned arithmetic.

I find that most programs deal with values for their integer types with an absolute value of under 1000; about the only excuse for using an unsigned type, IMO, is when you must have access to that highest bit in a defined way (for safe shifting and bit-twiddling).

> they are opting in to a subtly different arithmetic that most people are surprised by

I think that's a "citation needed" moment there. It's true that any native integer type will strange if you go outside of its defined range. The only way to avoid that is to use a language that automatically converts to bignums behind the scene (Common Lisp, etc)

What I don't agree with is that this is something that "most people are surprised by" If anything, the word "unsigned" is a pretty good hint about what behavior you'll get.

And even when you play fast-and-loose with the rules, it usually turns out ok:

   unsigned a, b, c, d;
   a = b + (c - d);
even if d > c, this will do the expected thing on any 2's compliment architecture. Now, this will break if a and b were instead "unsigned long long". I think that case is fairly rare -- it's not a mistake I've seen commonly in real life (especially compared to the dangerous "botched range-check of a signed value" error)

But you are correct that it's not "merely an advert for an invariant" -- it's advertising that the compiler actually reads. It gives you better warnings (I've had plenty of bugs prevented by "comparison of signed and unsigned" warnings) It also allows the compiler to optimize better in some cases: compare the output of "foo % 16" with foo as signed and unsigned.

> It would be better to have a range-checked types, like Pascal

Adding runtime checks to arithmetic is the type of costs that are never going to be in C. This is no different than saying "C should have garbage collection" or "C should have RTTI". They're perfectly valid things to want in a language, but they're anathema to the niche that C holds in the modern world. With C I want "a + b" to compile down to one instruction -- no surprises.

And even if you DID do a range-check, what do you do if it fails? 1. Throw an exception? Sounds logical... oh wait, this is C there's no such thing as an exception 2. Clamp the value? Now you have behavior that is just as bizarre as an integer overflow 3. Crash? Not very friendly.. 4. Have a user-definable callback (i.e. like a signal) What is the chance that the programmer will be able to make meaningful recovery though?

There are, however, some additions to the C99 type system that I think would be useful.. for example C++11's strongly typed enum's are a good idea.

> I find that most programs deal with values for their integer types with an absolute value of under 1000

I find that most programs deal with values greater-than-or-equal-to zero.

I find that most programs deal with values greater-than-or-equal-to zero.

-1 is very frequently used as a sentinel value. For example, counting backwards through the elements of some container:

    for (i = count - 1; i >= 0; --i)
        /* body */;
I've had plenty of bugs prevented by "comparison of signed and unsigned" warnings

You wouldn't have had these warnings, much less needed to pay attention to them, if you hadn't had to use unsigned types in the first place.

This conversation is much like those around GC. It's impossible to convince people labouring under tyranny they've learned to love without them experiencing a free life first. You just can't communicate it with words.

Oh man, so I've been doing in wrong all the time. I always thought it would be a good idea to use the type system to its full capabilities and complained about the compiler for not adequately slapping my wrist when I obviously assign negative numbers to unsigned variables (a sign analysis is pretty simple to implement!).
Yes, you have been doing it wrong all this time, and I'm fairly confident in this. Using the type system to its full capabilities is not in itself a natural good. Sign analysis doesn't help you with the problem, because the problem is that using unsigned types means opting into a different arithmetic, an arithmetic which is unnatural to most humans.
> it's difficult to detect an invalid result from combining or comparing signed and unsigned ints

Isn't this why you should compile with all warnings on?

I'd wager that 90%+ of the time, people fix "comparison between signed and unsigned values" warnings by casting one side of the expression.

But if you really want to eliminate the potential for a bug from this warning, you have to go back through and tweak/check the values you're testing, all the way back to their source, fixing signedness along the way. At this point you may as well have settled on a default to begin with.

The real pain comes when you have to interface with external code. Even in the standard library, you'll find size_t (eg fread(3)) and ssize_t (eg read(2)). You're going to have a mismatch with one or the other.

I care less about C's specific behaviour on shifting and overflow (both of which are pretty rare), and more about the fact that unsigned integers use a different arithmetic to the signed integers most people are familiar with. In particular, subtraction doesn't mean what you think it does. At 0 in unsigned arithmetic, there's a gaping cliff you can fall off of where you wrap around the other side, while at 0 in signed arithmetic, you're well away from that cliff and are highly unlikely to get anywhere near to it. Writing a program using many unsigned numbers means playing on the edge of a cliff.
I wouldn't say they are evil. In fact, both signed and unsigned are the same--the only difference is the "pain point" (the place where you subtract 1 and your world breaks) is in a different spot. 0 for unsigned, INT_MIN for signed. Both are perfectly fine as long as you stay in their good range.
Yes - but 0 is much closer to the range most people put in their integer values than INT_MIN. The cliff you fall of off is far closer with unsigned integers.
A coworker of mine was just bit badly by Java's insistence that unsigned types are so evil that the language shouldn't have them. He calculated a 32-bit hash, but since the ints are all signed, he took the absolute value before the modulo with the hash table size. That's all well and good, but abs(-2147483648) is still -2147483648 in 32-bit two's complement arithmetic.

I'm sure I don't need to point out that this particular problem had nothing to do with unsigned types (they were signed!). A better rule of thumb is: never use "long" in C/C++ unless you really don't care whether it's 32 or 64 bits.