Then you have to deal with branch mispredictions which may hurt performance pretty bad if the RB is heavily trafficked (which often is the use case for an RB).
Actually it might not really be mispredicted. Slightly older Intel CPUs had a dedicated loop predictor that would exactly predict quite complicated taken/non-taken sequences. If the RB is of fixed size the edge case would be always perfectly predicted.
More recent CPUs, IIRC, do away with the dedicated loop predictor as they have a much more sophisticated general predictor, which, although won't guarantee perfect prediction on this case, it might still get close enough.
Depends heavily on the size of the buffer. If it's only three elements large, then branch overhead will be measurable. But for larger buffers, it likely will always be predicted as not taken, and you only have a branch miss upon wraparound.
More recent CPUs, IIRC, do away with the dedicated loop predictor as they have a much more sophisticated general predictor, which, although won't guarantee perfect prediction on this case, it might still get close enough.