In the first two links you sent, the 40% result looks like the baseline case getting slower, not the unit under test getting faster. The core assembly looks look identical in both cases.
The order in which the tests were run was the first thing I checked in his implementation, but I looked too quickly and thought he was generating the data for each variant, so I assumed that was not the problem. [Actually, you need the same data for both tests, but generated twice]
I was going to just point out that 40% percent difference would mean that the version without the sentinel can be improved... was going to check if there is something that is preventing the branch prediction from actually taking care of that performance drop - memory is only being read and nothing should be invalidated...