Hacker News new | ask | show | jobs
by twanvl 1392 days ago
Other advantages of zero based indexing, beyond being 'closer to the machine':

It works better with the modulo operator: `array[i%length]` vs `array[(i+length-1)%length+1]`. Or you would have to define a modulo-like operator that maps ℕ to [1..n].

It works better if you have a multi-dimensional index, for example the pixels in an image. With 0 based indexing, pixel `(x,y)` is at `array[x+widthy]`. With 1 based indexing it is at `array[x+width(y-1)]`. You might argue that programming languages should support multi-dimensional arrays, but you still need operations like resizing, views, etc.

6 comments

Another advantage is with ranges: 0-based indexing and exclusive ranges work well. This is apparent with cursor position in text selection

Consider:

    Characters       h e l l o
    Cursor index    0 1 2 3 4 5
    Char index       0 1 2 3 4
    Range [0,3)     [0,1,2]
    Range [2,5)         [2,3,4]
    Range [1,1)       []
If we used 1-based indexing and exclusive ranges, it leads to ranges where the end index is greater than the string's length...

    Characters       h e l l o
    Cursor index    0 1 2 3 4 5
    Char index       1 2 3 4 5
    Range [1,4)     [1,2,3]
    Range [3,6) (!)     [3,4,5]
    Range [2,2)       []
but if we use inclusive ranges, it leads to ranges where the end index is less than the start index...

    Characters       h e l l o
    Cursor index    0 1 2 3 4 5
    Char index       1 2 3 4 5
    Range [1,3]     [1,2,3]
    Range [3,5]         [3,4,5]
    Range [2,1) (!)   []
Also:

    Characters            h e l l o
    Cursor index         0 1 2 3 4 5
    0-based range [0,3)  [0,1,2]
    1-based range [1,4)  [1,2,3]
for the 0-based range [0, 3), the left array bracket is at cursor index 0, and the right bracket is at index 3. With 1-based indexing it doesn't work like that because the range is [1, 4)
This is the bane of my existence working with Lua.

Iterating an array or adding to the end are fine, we have ipairs and insert for that, but ranges on strings I'm constantly having to think harder and write more code than necessary.

I love the language, wouldn't trade it for another, but the 1-based indexing on strings, which represents an empty string at position 3 as (3,2), it's egregious.

Not as egregious as a dynamic language where 0 is false though.

> a dynamic language where 0 is false though.

well, C also considers 0 being false (and you can argue that C is "dynamic"!).

Yeah, offsets are just easier to mathematically manipulate than ordinals. It's not just pointer arithmetic where it matters that item i corresponds to start+i*step. Any time you want to convert between integer indices and general linearly-spaced values, 0-based indexing is more convenient.
In many domains code maintenance is more important than hardware costs. In many domains 1-based indexing is a better fit, meaning less conversion code, meaning simpler code. Thus, the best indexing choice depends on the domain and circumstances, as do many controversial questions. Most tend to specialize in specific kinds of domains and over-extrapolate their experience into other domains.
I'd agree so why did languages that allow specifying the base die (other than maybe VBA).
Because changing the base is a great source of bugs.

That said, not all languages have given up on this. For example Julia allows it. https://docs.julialang.org/en/v1/devdocs/offset-arrays/

Ironically I learned this from a discussion of Julia bugs. Apparently changing offsetting of arrays has proven to be a source of bugs in Julia. So maybe someday they will come to the same conclusion as languages like Perl and stop allowing it.

I'd argue 0-base is a source of bugs too! Ideally we'd be able to catch more array indexing bugs at compile time - there are definitely cases where it should be possible to determine that arrays are being incorrectly indexed via static analysis.
The problem is that libraries which assume 0-base break when you have a 1-based array. And vice versa. Trying to combine libraries with different conventions becomes impossible.

Therefore changing the base leads to more bugs than either base alone.

That said, the more you can just use a foreach to not worry about the index at all, the better.

Of 0-based and 1-based, the only data point I have is a side comment of Dijkstra's that the language Mesa allowed both, and found that 0-based arrays lead to the fewest bugs in practice. I'd love better data on that, but this is a good reason to prefer 0-based.

That said, I can work with either. But Python uses 0-based and plpgsql uses 1-based. Switching back and forth gets..annoying.

I'd expect the compiler not to let you to pass a 0-based array to a library function expecting a 1-based array. I'm pretty sure that's how it worked with Visual Basic, which was the only language I ever used such a feature in.
A disadvantage comes to mind. While the following loop works as expected:

    for (size_t i = 0; i < length; i++) ...
The following causes an unsigned integer underflow and is an infinite loop:

    for (size_t i = length - 1; i >= 0; i--) ...
As Jens Gustedt points out[1], the following intentional unsigned overflow works perfectly for downwards iteration (even when length is 0 or SIZE_MAX), though it looks a bit confusing at first:

  for (size_t i = length - 1; i < length; i--) ...
You are also free to start at any other (not necessarily in-bounds) index, just like with ascending iteration.

[1] https://gustedt.wordpress.com/2013/07/15/a-praise-of-size_t-...

Principle of least surprise violated.

Also, that behavior is not guaranteed. The programmer would need to be aware of how the particular machine in question actually handles that.

Then again, that's C.

Unsigned integer underflow and overflow are both guaranteed to wrap by the C standard.
Uh I'm confused but don't know c++.

why doesn't that loop end instantly?

I mean length - 1 < length should always be true, right?

Or does it only terminate when the number underflows? Terribly confused here

For loops are translatable from:

  for(initialize; condition; increment) { ... }
to:

  initialize;
  while(condition) {
    ...
    increment
  }
(more or less, some scoping things not encompassed by the above; this is also how pretty much every for loop in a C-syntax language works) The condition of a for loop is equivalent to a while loop's condition. So yes, length - 1 < length will be true on the first iteration, which is fine because the loop continues as long as that condition is true.

What the above approach takes advantage of is that when underflow eventually happens you'll have this condition:

  MAXINT < length
Which will terminate it for all possible values of length.
It‘s an unsigned int, so past 0 it overflows back to the maximum
Ooh, i see. I wasn't aware that they under/overflow at 0 when they're unsigned. Thanks for broadening my horizon!
It loops while the condition is true. When an underflow happens, it stops being true.
It’s a condition to run, not a condition to stop
That was my reaction - why should anyone think it might be the latter? Are there languages that do have such a syntax without explicit keywords ("do...until")?
I think lisp or scheme does. I was often confused by that when I was playing with it
The loop continues until i transitions from 0 to 0 minus 1. 0-1 in this case actually doesn't equal -1 since size_t is an unsigned type, instead it wraps around to be the largest possible positive integer instead. TLDR; yes as you speculate it terminates when the number underflows.
the footnote [1] should be [0], just for the sake of this very topic.

Seriously though, while the idiom does work for unsigned integers, it's a bad idiom to learn [makes code reviews harder]. The post-decrement one in the loop body works with everything (signed/unsigned), and it's well known.

In that specific case I'd do the following:

    for (size_t i = n; i-- > 0 ;) ...
Or count from `length` to 1, but subtract 1 in the loop body, or count up and subtract the length in the loop body. Any modern compiler should be able to optimise these to be equivalent.

In the majority of cases, counting down is not necessarily. Nor is ordered iteration. Most languages have a `for each` style syntax that's preferable anyway.

Ah yes, the goes-to operator -->
And the wink operator, so the compiler knows you know the deal
I like how GP is called "operator-name" and instead of doing himself, he makes others joking with operator names. Although I'm not sure if it's altruism or highly manipulative behaviour.
Gold - have an upvote!
Or alternatively:

  size_t i = length;
  while (i--) ...
Unless I am remembering C wrong, this would work but the

   for (size_t i = length; i-- > 0 ; ) ...
that several other people posted would not execute for index 0. Shouldn't it be this instead?

   for (size_t i = length; --i > 0 ; ) ...
the correct way/idiom to reverse iterate an array is

  for (size_t i = length; i-- > 0; )...
It's surprising how often the issue pops, it works well with both signed and unsigned integers.

(edit) I've started with one based indexing (basic)... mixed with 0 based (assembly), more 1 based (pascal), then more stuff (all zero based). I am, yet, to see a real advantage of a one based indexing... after the initial process.

Is this cache friendly?
Sure, why wouldn't it be? As far as a cache is concerned, I don't think reverse sequential iteration would be any different than forward sequential. The actual RAM accesses may be less optimal if there's some speculative pre-fetching with assumed forward sequential access, but that's conjecture.
With some exceptions, hardware prefetch works in terms of ascending accesses. To learn if a particular CPU will prefetch for descending access, benchmarking is essential. Best to use soft prefetch calls if performance is critical.
i would suspect that the cache prefetch/prediction could use the "velocity" of the memory access to predict the next access; so if the access pattern was going backwards, the "velocity" would be negative, but prefetching would still work if they just followed the predicted pattern.
It is not, unless your compiler is smart enough to recognize reverse iteration and prefetch appropriately.

If performance matters, you should experiment with __builtin_prefetch, which is available in clang and GCC.

It’s not. It was nice on architectures were cache didn’t matter much and were subtracting and comparing to zero was just one instruction (looking at you old core ARM)
In the C programming language unsigned integers do not overflow. They wrap. This is well-defined behaviour and the example code is simply incorrect. Most modern compilers will give you a diagnostic for this.
If you have foolishly turned off -W in your build system, that could happen. Otherwise, you get a nice warning pointing out your folly.
Unless wrapping underflow is sensible for the domain (which it isn’t when representing the size of something), unsigned integers are usually a bad idea.
You can always rip a page out of C++’s playbook:

    for (size_t i = length; i > 0; i--) {
        // ...
        item = array[i - 1];
(This is how reverse iterators work in C++.)

  for (size_t i = 0; i < length; i++) {
      size_t j = (length - 1) - i;
      ...
  }
EDIT: change i to j
That's quite broken (you'd want a different variable inside the body, vs clobbering the iteration counter, else this would process the last item in your list, then exit).
I hope you haven't done this anywhere.

Makes my brain hurt, but I think this will only run through the loop one time looking at the last element of the array.

Uh no don't reassign the loop variable in the inner scope. Use:

    const j = (length - 1) - i;
in that case. Much safer.
You should save the old value of i somewhere and restore it back at the very end of the loop. Or simply define a new j like another comment says.
A rare case where 1-based indexing is more convenient is complete binary trees laid out breadth-first (as in a standard binary heap): parent is i div 2 and children are 2i and 2i+1 when starting at one and who knows what when starting at zero. But that’s the only one I know.
Except 1-based indexing is what we use in normal language. We don't use "zeroeth" or "player (number) zero" etc. And the word "first" is shortened to 1st etc. Personally I think we'd be better off if programming languages stuck to the same convention - off-by-1 errors aren't the hardest problems to deal with but they're still annoying.
Sure, it's technically a word, but hardly one you'd casually drop into your conversations (with non-programmers)
> "player (number) zero"

That would be because any game worth playing has at least one player... and so it's natural to continue from there. (In terms of language.)

You might have heard of Conway's Game of Life.
Fair :)
That's confusing ordinal and cardinal numbers. The element with index 0 is the first number. The element with index 1 is the second number, and so on.

Using the term "zeroth" is basically some form of showing off (even though it's kinda fun), but will be utterly confusing when you get to the fifty-second element which is the last in a group of 53 elements.

I'm not confusing them, my point about abbreviating "first" as 1st was that in typical speech we start counting at 1. Nobody says "let's start with item zero on the list". But programmers are stuck with having to say/think "item 0 in the array".
I don't disagree with you. I just don't think a programmer should be confused about the statement "item 0 is the 1'st item".
The 2 major blunders in programming: 1) Off by one errors.
With 0-based indexing the children are at 2i+1 and 2i+2. The parent is at (i-1) div 2.

Not hard to figure out.

> With 0-based indexing the children are at 2i+1 and 2i+2. The parent is at (i-1) div 2.

> Not hard to figure out.

While that's true, "you just shift by 1" is equally good at all arguments for or against 0-based indexing, so deploying it here probably won't convince.

I was not trying to convince.

That said, the effort of one versus the other is so trivial that there is no point in ever using effort as an argument either way. Doubly so because what seems like effort to us is simple unfamiliarity.

What is important is which one leads to more careless errors in practice. As a trivial example, consistent indentation takes effort, but failing to do it leads to more careless errors. Therefore everyone indents code.

The only data point I've seen on that is the side remark about Mesa in https://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/E.... That remark, therefore, is the only argument that I care about.

An incredibly unrare case happens all the time in my work: array[length - 1], or variations of this. Anything involving the last element of the array, and often iterating through the whole array will use something similar at some point.
This is the reason I always think of when I hear the question
I think the 1-indexing folks would have to argue that a%b should return a value from 1 to b inclusive. This does make the same sort of intuitive sense as 1-indexing. For example we number clocks from 1 to 12.
% is defined as (mostly) a remainder operator. I don't think you want to change the semantics of division itself.
Right. Ultimately the 1-indexers are wrong.
But interestingly, the number at the top is 12, not 1.
No we don't. Clocks start at 0. what time is it when it's half an hour after midnight? 00:30.
but nobody "says" zero o'clock - it's always twelve o'clock!

and the example is in 24hr format - which needs to have the 00 to differentiate it from being 12:30. But if you write in 12hr format, you don't ever use 00 - it's always 12:30am or 12:30pm

First of all, you don't. Or I guess most Americans don't. I do, or most Europeans do. 12.30am and 12.30pm feels just very wrong in Europe.

But yea, I do agree with you. we don't say 0, we say 12, no matter if it's noon or midnight. That's because we humans avoid saying zero when we mean zero.

It's the same with other comments here, talking about counding seconds, we say "ok go, one, two,.. and not "zero, one, two,..". We say 3 months old baby, and not 0 years old baby.

We use 0-index in so many things, we just avoid saying the word "zero", and we use other names or other units to avoid that word.

I'm from Sweden, and I certainly would say 00:15 (as zero fifteen). Although the time at exactly 00:00 would be called midnight.
12 is zero, in 12-hour clocks.