Made of Bugs

Indices Point Between Elements

If you're familiar with nearly any mainstream programming language, and I asked you to draw a diagram of an array, the array indices, and the array elements, odds are good you'd produce a diagram something like this:

In this post, I want to persuade you to replace that image, or, at least, to augment it with an alternate view on the world.

I want to argue that, rather than numbering elements of an array, it makes just as much sense, and in many cases more, to number the spaces between elements:

With this representation, we do have to relearn how to refer to indexing: We refer to A[i] no longer as "The element at index i", but rather "The element to the right of index i".

Representing Ranges

I'll run through a few reasons why I prefer this representation, but most of them boil down to representing ranges of elements.

Suppose we have an array, and we want a way to refer to a certain subset of it, like so:

One obvious answer is with a start and a length: start=2 length=3, but it's often convenient to represent a range as a pair of (start, end) indices. The latter representation, for example, lets you check if an index falls into the range directly.

If we number elements, it's not immediately apparent which index to use for end:

Both (1, 3) and (1, 4) seem initially defensible. But if we number between elements, there's a clear, unique answer:

The indices we want are the ones that lie between the included and excluded elements: (1, 4).

With this model, the rules of range manipulation and comparison become straightforward:

  • Two ranges are adjacent if left.end == right.start

  • One range is a subset of another if inner.start >= outer.start && inner.end <= outer.end

  • A range contains end - start elements:

In order to answer the question "if I dereference an index, is the result contained in the range?", we need to remember that A[i] is now defined as the element after index i. With that in mind, it becomes easy to see that for a range (start, end), elements indexed by start <= i < end are within the range.

Off-by-one errors

Indexing between elements, instead of indexing elements, helps avoid a large class of off-by-one errors. I'll run through a number of examples using Python, but the fundamental issues apply to array APIs in many more languages, or to any time you're manipulating an array-like data structure via any of these operations.

Inserting elements

If you want to insert an element into an array, how do you specify the location? If you name an existing element, does the new element go before or after that element?

Python's standard library documentation somewhat awkwardly specifies that "The first argument is the index of the element before which to insert," clarifying that this means insert(0, X) inserts X at the start of the array.

But if we number gaps between elements, instead of numbering elements, the story is perfectly clear: 0 names the gap before the first element, and so of course inserting at 0 should prepend an element. Similarly, 1 names the gap between the first and second element, and all the way on.

Slicing arrays

How do we refer to a partial subset of an array that we want to extract? Python, like many other languages, lets you use a pair of indexes:

>>> [1,2,3,4][1:3]
[2, 3]

The documentation, however, has to resolve the same ambiguity noted above: Is the final index excluded or included? Ruby even helpfully offers you both choices:

irb(main)> [1,2,3,4][1...3]
=> [2, 3]
irb(main)> [1,2,3,4][1..3]
=> [2, 3, 4]

As discussed earlier, if we adjust our view of indexes, there is no ambiguity at all. Conveniently, this also gives us the same semantics as Python and most other languages: there are good reasons half-inclusive ranges are generally preferable, and most languages converge on this choice.

Removing elements

If we want to remove a single element from an array, it does seem simpler to index elements directly – we can just name directly the index which we want to eliminate.

However, if we want to adopt the more general primitive, of removing slices, (Python's del array[x:y]), we run into the same problem as extracting slices, previously. Once again, shifting our thinking to index between elements removes all ambiguity.

Incrementally consuming an array

Suppose we're walking through an array, consuming it by elements or groups of elements at a time. Perhaps we're parsing a string, consuming tokens as we go.

How do we keep track of our current position? Should we keep the index of the last element we've processed, or of the first element we have yet to process?

If we shift our perspective, this problem too vanishes: We can store the index between the last item consumed, and the next one to be consumed. Our index neatly partitions the buffer into "processed" and "to-be-processed", with no ambiguity at all.

C/C++ Pointers and Iterators

With pointers in C, or with iterators in C++ (which were essentially designed to mimic C's pointer semantics), we speak of pointers into an array or of iterators as referring to a specific element in memory.

However, both systems allow for this additional "valid" iterator or pointer, which points "just past the end" of a container. This pointer/iterator does not name a valid element, but is a valid pointer or iterator. The C specification is full of awkward verbiage to address this special-case:

both [pointers] shall point to elements of the same array object, or one past the last element of the array object;

(N1256 ยง6.5.6p9). And with a C++ std::vector, v.begin() and v.end() are both valid iterators, but v.end() points "one past the end" and cannot be dereferenced.

These apparent odd inconsistencies and special cases vanish if you shift your thinking slightly in just the way I've been arguing: Instead of thinking of iterators as referring to individual elements, we hold that they name the interstitial points between elements.

If we do so, the "one-past-the-end" iterator is no longer "past" the end – it points directly at the end, which is no more fundamentally special than the "start" iterator which points directly at the beginning.

It's still the case that we cannot dereference v.end(), but that behavior is a function of the "dereference" operation, which selects the element after an iterator. The iterators themselves are no longer special cases.

Postscript: 0 or 1?

It used to be popular, and still is in some circles, to debate whether programming languages ought start array indexing at 0 or 1. And there are still a few holdouts, like Matlab, which number their arrays starting from 1, causing no end of pain and confusion to those poor souls who have to switch between them and more mainstream languages.

Once I started thinking of pointers or iterators or indexes as indexing between elements, one of the more mind-bending realizations that followed was that this model can harmonize the "index from 0" and "index from 1" camps!

Let's consider an array with interstices labeled again:

The first element, A[0] in C or Python, is bracketed by indexes 0 and 1. The decision, then to name it as 1, does not involve changing the picture at all; If you draw indices between elements, the statement "Arrays start at 1" is simply a decision that "The deference operator refers to the element to the left of an index," in exactly the same way that I described dereference in a 0-indexed language as taking the element to the right. And presented that way – "should dereference take the left or the right element?" – it becomes clear that 0 or 1 really is an arbitrary choice.

Acknowledgements

I credit my personal shift in thinking – from labeling elements to labeling interstices – to my reading of the excellent The Craft Of Text Editing, which introduces the concept both for its notion of a mark, and specifically as an implementation idea while talking about buffer representations.

I recommend giving the book a read even if you never aspire personally to implement an Emacs; It's filled with a great number of interesting ideas, and possessed of a profound clarity of thought throughout.