IME CPU caches can be observed in almost all languages, regardless of how detached they are from the machine. What I'm really wondering is, if branch prediction can be observed from interpreted code.
It is! Although my test case is probably an unrealistically bad scenario:
It's the classic, why is processing sorted array faster than unsorted one
def f(arr):
r = True
for x in arr:
if x > 128: r = not r
return r
arr = [ random.randint(0, 255) for i in range(0, 1000_000) ]
arr_sorted = list(sorted(arr))
%timeit f(arr)
%timeit f(arr_sorted)
Results are (on my machine): 17.5 ms for unsorted, and 13.5 ms for sorted. For comparison, in a compiled language, the difference is 4x
This depends on the Python version, but if it has the specializing interpreter changes, the `COMPARE_OP` comparing the integers there is probably hitting a specialized `_COMPARE_OP_INT` [1].
This specialization has a ternary that does
`res = (sign_ish & oparg) ? PyStackRef_True : PyStackRef_False;`.
This might be the branch that ends up getting predicted correctly?
Older versions of Python go through a bunch of dynamic dispatch first and then end up with a similar sort of int comparison in `long_richcompare`. [2]
This is a really good example. It is more like branch prediction than standard data/instruction caching.
I wonder if you could do Spectre type vulnerabilities in python. You would need some way to leak micro-architectural state, so without being particularly clever maybe python code could only be used as a gadget or something.
Python speed up is probably from small integer caching, a sorted array will have runs of pointers to the same integers adjacent. The compiled language one is probably branch prediction right?
I intentionally stayed in the small integer range to avoid benchmarking the cache. 256 distinct values should fit into L1 just fine in both cases.
I'm now thinking that the difference might be even larger if we instead avoid small integers and let the CPU get stuck chasing pointers. The idea is that it gets stuck on a memory access, which forces it to speculate much further, which in turn makes it backtrack a longer path if a branch was mispredicted. I'm obviously no expert on this, feel free to correct me
The results for 1B range instead of 255 are 17.6 ms for unsorted / 68.2 ms for sorted! We are back to what the original article observed and it's a way stronger effect than what branch prediction can offer. So don't sort your arrays, keep them in the order the boxed values were allocated ;)
How big is the pointed to small integer? With alignment etc. I'm seeing some stuff saying 256 of them would fill an 8KB L1. Plus other stuff for the interpreter might overfill it. Sorted that would be less of an issue.
Larger range one being slower unsorted yes makes sense because of allocation order no longer matching the iteration order.
I don't know how large are those boxes, but normal CPU L1 cache has 32 or 48KB which should be plenty for this. Python opcodes for this program are going to be tiny, and the interpreter itself uses the instruction-L1 cache (which is another 32-48KB). I hope the sequential scan of the big array won't flush the L1 cache (there should be 12-way associativity with LRU, so I don't see how it could).
Anyway, there is no need to have 256 integers, just 2 is enough. When I try that, the results are similar: 17.5 ms (unsorted) / 12.5 ms (sorted)
Then you are back to what the article discusses. Each integer is in a separate box, those boxes are allocated in one order, sorting the array by value will shuffle it by address and it will be much slower. I tested this as well, see the other comment.
My guess would be that branch misprediction does have an impact on interpreted language, but much less. If bytecode instructions take on average 20 CPU cycles to execute, and the branch misprediction penalty is 50 CPU cycles, the relative cost of a misprediction is much smaller than in compiled code.
Why would you think that branch prediction wouldn’t be? Do you think the interpreter is adding too much false sharing to the branch predictor to render it worthless?
doesn't necessarily correspond to one branch at the ASM instruction level right? in fact, a priori, it doesn't even correspond to absolutely any branches at the ASM instruction level.
Yes I do and that’s neither here nor there. There almost certainly ASM instruction level branches to implement the conditional and the branch predictor isn’t tied to a single instruction - the rough high level mental model is a cache of a few of the least significant digits of the CPU to the prediction although in practice it’s far more complicated. Since the predictor is right like 80% of the time, it means that even when there’s false sharing of a lot of branches, the CPU does a good job predicting. It’s performance is primarily impacted when execution takes both branches closer to 50/50 than to 100/0.
That’s why I asked about false sharing specifically as that’s the main way I can think of that Python code wouldn’t be able to observe the branch predictor performance - because the interpreter has so many branches internally that it dominates any branches that might be caused by the Python code itself.
It corresponds to a way more than one branch at instruction level. The branch prediction AFAIK does not care based on what are you branching, it just assumes branches will go in similar sequences as they did last time. If the Python 'if' is never taken, the instruction-level predictor will learn that after the comparison operation, there is an 'if' operation and then another array access operation. If the Python 'if' is unpredictable unpredictable, CPU predictor can't be sure which opcode are we processing after 'if', so there will be penalty.
Is there any public documentation on modern branch prediction algorithms? I know branch prediction is very important to modern CPU so SOTA techniques are probably not public... But it's really amazing what it can do especially considering the "limited" cache sizes that branch predictors have .
I guess it depends on how deep you want to go, I think the real predictors are based on publicly known algorithms such as TAGE. This seems to be nice overview: https://comparch.net/2013/06/30/why-tage-is-the-best/ (it's 2013, so definitely not SOTA, but advanced enough for my taste :) )
It's the classic, why is processing sorted array faster than unsorted one
Results are (on my machine): 17.5 ms for unsorted, and 13.5 ms for sorted. For comparison, in a compiled language, the difference is 4x