| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by PhilipRoman 482 days ago
	IME CPU caches can be observed in almost all languages, regardless of how detached they are from the machine. What I'm really wondering is, if branch prediction can be observed from interpreted code.

5 comments

exyi 482 days ago

It is! Although my test case is probably an unrealistically bad scenario:

It's the classic, why is processing sorted array faster than unsorted one

    def f(arr):
        r = True
        for x in arr:
           if x > 128: r = not r
        return r

    arr = [ random.randint(0, 255) for i in range(0, 1000_000) ]
    arr_sorted = list(sorted(arr))

    %timeit f(arr)
    %timeit f(arr_sorted)

Results are (on my machine): 17.5 ms for unsorted, and 13.5 ms for sorted. For comparison, in a compiled language, the difference is 4x

link

ammar2 482 days ago

Edit: Analyzed the wrong thing earlier.

This depends on the Python version, but if it has the specializing interpreter changes, the `COMPARE_OP` comparing the integers there is probably hitting a specialized `_COMPARE_OP_INT` [1].

This specialization has a ternary that does `res = (sign_ish & oparg) ? PyStackRef_True : PyStackRef_False;`. This might be the branch that ends up getting predicted correctly?

Older versions of Python go through a bunch of dynamic dispatch first and then end up with a similar sort of int comparison in `long_richcompare`. [2]

[1] https://github.com/python/cpython/blob/561965fa5c8314dee5b86...

[2] https://github.com/python/cpython/blob/561965fa5c8314dee5b86...

link

dzaima 482 days ago

This isn't actually timing the sorting, but just the (dumb) function f.

link

ammar2 482 days ago

Oh whoops, that's right. I totally missed that.

link

LPisGood 482 days ago

This is a really good example. It is more like branch prediction than standard data/instruction caching.

I wonder if you could do Spectre type vulnerabilities in python. You would need some way to leak micro-architectural state, so without being particularly clever maybe python code could only be used as a gadget or something.

link

cma 482 days ago

Python speed up is probably from small integer caching, a sorted array will have runs of pointers to the same integers adjacent. The compiled language one is probably branch prediction right?

link

exyi 482 days ago

I intentionally stayed in the small integer range to avoid benchmarking the cache. 256 distinct values should fit into L1 just fine in both cases.

I'm now thinking that the difference might be even larger if we instead avoid small integers and let the CPU get stuck chasing pointers. The idea is that it gets stuck on a memory access, which forces it to speculate much further, which in turn makes it backtrack a longer path if a branch was mispredicted. I'm obviously no expert on this, feel free to correct me

The results for 1B range instead of 255 are 17.6 ms for unsorted / 68.2 ms for sorted! We are back to what the original article observed and it's a way stronger effect than what branch prediction can offer. So don't sort your arrays, keep them in the order the boxed values were allocated ;)

link

cma 482 days ago

How big is the pointed to small integer? With alignment etc. I'm seeing some stuff saying 256 of them would fill an 8KB L1. Plus other stuff for the interpreter might overfill it. Sorted that would be less of an issue.

Larger range one being slower unsorted yes makes sense because of allocation order no longer matching the iteration order.

link

exyi 482 days ago

I don't know how large are those boxes, but normal CPU L1 cache has 32 or 48KB which should be plenty for this. Python opcodes for this program are going to be tiny, and the interpreter itself uses the instruction-L1 cache (which is another 32-48KB). I hope the sequential scan of the big array won't flush the L1 cache (there should be 12-way associativity with LRU, so I don't see how it could).

Anyway, there is no need to have 256 integers, just 2 is enough. When I try that, the results are similar: 17.5 ms (unsorted) / 12.5 ms (sorted)

link

bgwalter 482 days ago

That seems very likely. The benchmark should probably use a range that is guaranteed to be outside of the cached smallint range.

link

exyi 482 days ago

Then you are back to what the article discusses. Each integer is in a separate box, those boxes are allocated in one order, sorting the array by value will shuffle it by address and it will be much slower. I tested this as well, see the other comment.

link

eulgro 482 days ago

My guess would be that branch misprediction does have an impact on interpreted language, but much less. If bytecode instructions take on average 20 CPU cycles to execute, and the branch misprediction penalty is 50 CPU cycles, the relative cost of a misprediction is much smaller than in compiled code.

link

adgjlsfhk1 482 days ago

the counterpoint is that a lot of those 20 instructions are also branches

link

vlovich123 482 days ago

Why would you think that branch prediction wouldn’t be? Do you think the interpreter is adding too much false sharing to the branch predictor to render it worthless?

link

almostgotcaught 482 days ago

you do realize that the Python

  if x > 128: 
     r = not r

doesn't necessarily correspond to one branch at the ASM instruction level right? in fact, a priori, it doesn't even correspond to absolutely any branches at the ASM instruction level.

link

vlovich123 482 days ago

Yes I do and that’s neither here nor there. There almost certainly ASM instruction level branches to implement the conditional and the branch predictor isn’t tied to a single instruction - the rough high level mental model is a cache of a few of the least significant digits of the CPU to the prediction although in practice it’s far more complicated. Since the predictor is right like 80% of the time, it means that even when there’s false sharing of a lot of branches, the CPU does a good job predicting. It’s performance is primarily impacted when execution takes both branches closer to 50/50 than to 100/0.

That’s why I asked about false sharing specifically as that’s the main way I can think of that Python code wouldn’t be able to observe the branch predictor performance - because the interpreter has so many branches internally that it dominates any branches that might be caused by the Python code itself.

link

exyi 482 days ago

It corresponds to a way more than one branch at instruction level. The branch prediction AFAIK does not care based on what are you branching, it just assumes branches will go in similar sequences as they did last time. If the Python 'if' is never taken, the instruction-level predictor will learn that after the comparison operation, there is an 'if' operation and then another array access operation. If the Python 'if' is unpredictable unpredictable, CPU predictor can't be sure which opcode are we processing after 'if', so there will be penalty.

link

mardifoufs 482 days ago

Is there any public documentation on modern branch prediction algorithms? I know branch prediction is very important to modern CPU so SOTA techniques are probably not public... But it's really amazing what it can do especially considering the "limited" cache sizes that branch predictors have .

link

exyi 482 days ago

I guess it depends on how deep you want to go, I think the real predictors are based on publicly known algorithms such as TAGE. This seems to be nice overview: https://comparch.net/2013/06/30/why-tage-is-the-best/ (it's 2013, so definitely not SOTA, but advanced enough for my taste :) )

link

stingraycharles 482 days ago

It should be, as most interpreted code uses a JIT native compiler at this point, which implies the branch prediction is native as well.

I would be happy to stand corrected if someone knows more about this, though!

link

pansa2 482 days ago

> most interpreted code uses a JIT native compiler at this point

I suspect most JavaScript code is JIT-compiled - but most Python, Ruby etc is interpreted (via bytecode).

link

tarruda 482 days ago

AFAIK CPython doesn't JIT compile.

link

pansa2 482 days ago

Recent versions of CPython do have an experimental JIT compiler. I'm not sure how widely-used it is, though.

https://peps.python.org/pep-0744/

link

Qem 482 days ago

It's disabled by default, not production ready. You need to compile CPython yourself if you want to try it.

link

saagarjha 482 days ago

Yep, this is why Spectre mitigations are applied to browsers.

link