Hacker News new | ask | show | jobs
by jgrahamc 3477 days ago
This is of course not a new invention. The earliest instance I could find with a bit of searching was from 2004, with Andrew Morton mentioning in it a code review so casually that it seems to have been a well established trick. But the vast majority of implementations I looked at do not do this.

I was doing this in 1992 so it's at least 12 years older than the 2004 implementation. I suspect it was being done long before that. Back then the read and write indexes were being updated by separate processors (even more fun, processors with different endianness) with no locking. The only assumption being made was that updates to the read/write pointers were atomic (in this case 'atomic' meant that the two bytes that made up a word, counters were 16 bits, were written in atomically). Comically, on one piece of hardware this was not the case and I spent many hours inside the old Apollo works outside Boston with an ICE and a bunch of logic analyzers figuring out what the hell was happening on some weird EISA bus add on to some HP workstation.

It's unclear to me why the focus on a 2^n sized buffer just so you can use & for the mask.

Edit: having had this discussion I've realized that Juho's implementation is different from the 1992 implementation I was using because he doesn't ever reset the read/write indexes. Oops.

7 comments

I was doing this in 1992 so it's at least 12 years older than the 2004 implementation

Try late 1960s. Generally known then, widely used.

For an interesting proof about tokens in ring buffers, check out https://www.cs.utexas.edu/users/EWD/ewd04xx/EWD426.PDF, which, for 1974, has an interesting bit of multiprocessing.

> It's unclear to me why the focus on a 2^n sized buffer just so you can use & for the mask.

The cost of a mask can probably be entirely buried in the instruction pipeline, so that it's hardly any more expensive than whatever it costs just to move from one register to another.

Modulo requires division. Division requires a hardware algorithm that iterates, consuming multiple cycles (pipeline stall).

You do not need modulo or division to implement non-power-of-2 ring buffers. Because you will only increment by one. So instead of "x = x % BufferSize" you can do "if (x >= BufferSize) x -= BufferSize;" or similar.

That's for "normal" ring buffers. I suspect that the design described in the article can be implemented for non power-of-two without division but I'll need to think about the details.

You could also do a predicated conditional move instead. Just do the subtraction every time, and use something like cmov to only do the write if you need to.

I don't know if it would end up being faster, though.

Most likely (very very likely) the branch would be faster. It will almost always be predicted correctly (exceptions on the rollover) and cmov can be moderately expensive.

The general rule is to only use cmov if your test condition is mostly random.

I don't think this is true. I tried the sample out, and gcc, clang, and intel's compiler all generate a cmov for the code instead of a branch with -O2. I don't think all these compilers would have used a cmov instead of a branch if the cmov was more expensive than a branch in this case.

https://godbolt.org/g/nyFLwp

It might have something to do with the branch not being part of a loop so the best it can do is assume the branch is random (think something like modding a hash code when it would indeed be random).

Ad was pointed out in the thread, recent cpus have reduced the latency of cmov to a cycle. So your result could also depend on your architecture.

If you use __builtin_expect() the Intel compiler uses a branch. I mean this way:

if (__builtin_expect(index >= cap,0)) { ...

Can't the CPU just continue execute out-of-order while waiting on the cmov data dependency to finish though?

In which case cmov would be relatively cheap since it isn't blocking execution of other instructions.

It does, but it may run out of non-dependent instructions to execute while waiting for the long pole of the cmov dependency chain to finish.

This used to be a problem on Pentium4 where cmov had high latency (4 cycles or more), but today, IIRC, a register-to-register cmov is only one cycle, so it is safe to use whenever a branch could have a non-trivial misprediction rate.

Then you have to deal with branch mispredictions which may hurt performance pretty bad if the RB is heavily trafficked (which often is the use case for an RB).
Actually it might not really be mispredicted. Slightly older Intel CPUs had a dedicated loop predictor that would exactly predict quite complicated taken/non-taken sequences. If the RB is of fixed size the edge case would be always perfectly predicted.

More recent CPUs, IIRC, do away with the dedicated loop predictor as they have a much more sophisticated general predictor, which, although won't guarantee perfect prediction on this case, it might still get close enough.

Depends heavily on the size of the buffer. If it's only three elements large, then branch overhead will be measurable. But for larger buffers, it likely will always be predicted as not taken, and you only have a branch miss upon wraparound.
Modulo by a constant doesn't require a division, you can instead use multiplications, shifts, adds and subtracts. This transform is typical for compilers. For example, this is what gcc targetting x86_64 does to perform % 17 on the unsigned value in %edi:

  movl	$-252645135, %edx
  movl	%edi, %eax
  mull	%edx
  movl	%edx, %eax
  shrl	$4, %eax
  movl	%eax, %edx
  sall	$4, %edx
  addl	%edx, %eax
  subl	%eax, %edi
It doesn't require division, but how is that long sequence of instructions going to beat an AND?
That only works for compile time constants, though. With a power of two sized buffer you can just store the mask and decide how large you want your buffer to be at runtime.
libdivide (http://libdivide.com) implements similar logic at runtime
You can also compute the optimization at runtime, same as the compiler does. Libtheora contains code for this, used for quantizing video data.
The real problem with modulo is at the end of the integer range. Add one and overflow and suddenly jump to a totally different index in the array!

BTW: read and write pointers, power of two, did that in BeOS (1998) and many sound drivers did it earlier than that. To me, that seemed like the obvious way to do it when I needed it.

It's not just an optimization, it's necessary for correct operation. With a non-power of two buffer the integer wraparound causes a discontinuity.
As far as I can tell that int wrap around could be avoided by

- subtracting buffer size from both pointers once the read pointer has wrapped.

- choosing a longer int for the math operation where possible

That seems a small price for the freedom to be able to choose an appropriate buffer size.

One of the benefits of the original algorithm is the independence of the read and write indexes, they can be updated from different threads (or different processors!) without any atomic operations beyond writing or reading a value. Subtracting from both pointers requires an additional atomic read/modify/write operation.
You could also just restrict the pointers in the normal way but to two times the size of the buffer. So instead of wrapping at N you wrap at 2*N.

You are only encoding 1 bit of data (first or second) so adding more data than that by allowing unsigned integer overflow is just an optimization, not fundamentally necessary.

If you do that then the size() function becomes a problem. The original implementation relies on unsigned integer wrap-around to give the proper result when write < read.
That is fairly easy to fix, though. Add N to the size value you get until its non-negative.
It all depends on your use case. In my experience, most of the times when dealing with RB:s, it's more important to guarantee that the read index and the write index can be updated lock-free by different threads (e.g. one producer and one consumer) than to have a very specific non-power-of-2 capacity.
No, it's an optimization.
No, it's not. A non-power-of-two will lead to incorrect behavior, implementing the data structure the way the author describes, for precisely the reason given by your parent comment. There are other implementations that don't suffer from this (described in the comments there), but it's not simply a matter of replacing bitwise-and with a more generic computation of the modulus.

Imagine we had four bit integers and a three cell array. Stepping from 7 (= 1 mod 3) to 8 (= 2 mod 3) winds up stepping instead to 0 (= 0 mod 3) because overflow, which would reuse cells inappropriately.

I never claimed you should use a modulus operation.
You said,

> It's unclear to me why the focus on a 2^n sized buffer just so you can use & for the mask.

In fact it is required for correctness to use the approach he specified with any choice of masking operation.

Any choice?

He only ever calls mask() after an increment. There are other ways of detecting the overflow and wrap to 0 that don't involve modulus.

Array size of 5 produces a mask of 0100

Write index of 8 = 1000

Masking those together to create a write position 1000 & 0100 = 0

Doesn't work out correctly, got 0, would expect to get 2. In fact, you could never get a write position of 1, 2, or 3 with an array size of 5.

This is not the problem.

You are assuming still using &, which is very obviously incorrect with a very obvious fix (%5) which is still wrong because of behavior at overflow.

I'm answering the specific problem posed by the OP, given his other comments in this thread.

[EDIT] Resolved internal concerns about size calculations.

You might have misread the OP. He's not asking why using & for the mask requires a 2^n sized buffer. He's asking why bother using & when it imposes these additional constraints on us. A part of the answer is that the constraints are already there with this approach, even if you pick a different function for masking than f(i,n) = i & (n - 1) -- which point OP only recently understood due to a misreading of the article.
This is not only an optimisation. Power of 2 is necessary to avoid discontinuity. See the comments below the article for explanation.
Depending on the machine, a modulo can cost _alot_ .

Last time I checked, the operation cost was 26k cycles on my PIC.

Using a 2^n + mask made my queue perform 10 times faster (if not more).

Don't use a modulo then. Use subtraction.
Subtraction requires a branch, which could be worse (or not) depending on architecture.
Oh, come on, you don't need a branch to do a conditional subtraction. Reify the condition to 0/1 and use multiplication, or use AND with a two's complement of the condition.
OK, my bit twiddling knowledge is weak, my google skills are weaker still, and now I'm curious: what does "AND with a two's complement of the condition" mean, exactly?
x -= N & -(x >= N);
Thanks! This AND trick is great. Actually makes me want to go back to assembly-level programming.
Good point!
The branch will be properly predicted every time except for when it wraps. This should be faster than any of the alternatives.
I fully believe that there are plenty of contexts where that's true - particularly in any throughput oriented system where the buffer is large. But if you care, measure.
For what it's worth, Seymour Cray's Control Data 6600 implemented ring buffers for I/O channels correctly in hardware using gumdrop-sized single transistors in 1963. This is not exactly a new technology.
Me too, back in 2005 we were already doing all this. Also there is an easy hack that solves the overflow part.
It's odd that you were using ring buffers in 1992 for low level code but don't understand the value of avoiding a modulus instruction. Masking is far more efficient and often a ring buffer will be used in code where performance is absolutely critical.
You wouldn't use the modulus operation. You aren't adding some arbitrary number that's going to make you increase either index by more than the buffer length so you know that at worse you are going to need to subtract the length of the buffer.

IIRC the way we made this really fast was the write the buffer backwards. That way you can detect wrapping around the buffer because DEC will underflow and set the sign flag. Then you can JS to whatever code needs to ADD back the buffer length to handle the wrap around.

But 2^n has another problem (back in that era): buffer size. You are stuck with 1K, 2K, 4K, etc. buffers. When memory is tight you likely need something very specific, so you end up with the solution we had.

But, hey, if memory is free use 2^n bytes for your buffer.

You are still introducing a conditional by detecting the need to subtract, and iterating backward through memory is horrific for cache performance. If you need a specific, non power-of-2 sized buffer, then of course you make that design decision and pay the performance penalty. But I restate it's odd that you weren't even aware of the cost in 1992 as a system level programmer.
>iterating backward through memory is horrific for cache performance

This isn't true for Intel chips since Netburst Pentium 4. The hardware prefetchers can handle predicting iterating through an array forwards, backwards, and even strided accesses [0]. The arrays takes up the same number of cache lines in both cases, so going forwards or backwards are still going to have the same number of cache misses.

0: https://software.intel.com/en-us/articles/optimizing-applica...

You don't need a conditional. You can set up a mask using sbb.

                  ; precondition: 0 <= x <= N
                  ; (N is constant)
    mov y, 0      ; set up mask
    cmp x, N-1    ; set carry flag if x >= N
    sbb y, 0      ; subtract 1 from y if carry flag set
    and x, y      ; set x to zero if x == N
The cmov looks better than sbb, but both have data dependencies than a predicted branch wouldn't.
Ahh! Serves me right for reading books from before the 486 :P
The 286/386 didn't have a cache so that wasn't a worry at that time.
>But I restate it's odd that you weren't even aware of the cost in 1992 as a system level programmer.

Costs were very different in the pipelines (or lack thereof) of eighties/early nineties hardware.

Have you tested this recently? I haven't for some years now but performance was identical regardless of direction. Maybe I need to try it again. I'd expect going backwards to be no worse than "not as good" - like say perhaps the prefetching mechanism doesn't cater for this case - but maybe my standards aren't high enough and this is enough to tip things over into the horrific.