(The code is from the stream vbyte repo; see https://lemire.me/blog/2017/09/27/stream-vbyte-breaking-new-... ).
Edit: Example interesting fact: the first iteration takes 35 cycles before it finishes; but the reciprocal throughput of the loop (assuming it executes a few times) is 5.3 cycles.