Hacker News new | ask | show | jobs
by maknee 472 days ago
"All of this is made possible with the inclusion of frame pointers in all of Meta’s user space binaries, otherwise we couldn’t walk the stack to get all these addresses (or we’d have to do some other complicated/expensive thing which wouldn’t be as efficient)"

This makes things so, so, so much easier. Otherwise, a lot of effort has to built into creating an unwinder in ebpf code, essentially porting .eh_frame cfa/ra/bp calculations.

They claim to have event profilers for non-native languages (e.g. python). Does this mean that they use something similar to https://github.com/benfred/py-spy ? Otherwise, it's not obvious to me how they can read python state.

Lastly, the github repo https://github.com/facebookincubator/strobelight is pretty barebones. Wonder when they'll update it

1 comments

Already been done:

1) native unwinding: https://www.polarsignals.com/blog/posts/2022/11/29/dwarf-bas...

2) python: https://www.polarsignals.com/blog/posts/2023/10/04/profiling...

Both available as part of the Parca open source project.

https://www.parca.dev/

(Disclaimer I work on Parca and am the founder of Polar Signals)

Thanks! Those blogs are incredibly useful. Nice work on the profiler. :)

I have multiple questions if you don’t mind answering them:

Is there significant overhead to native unwinding and python in ebpf? EBPF needs to constantly read & copy from user space to read data structures.

I ask this because unwinding with frame pointers can be done by reading without copying in userland.

Python can be ran with different engines (cpython, pypy, etc) and versions (3.7, 3.8,…) and compilers can reorganize offsets. Reading from offsets in seems me to be handwavy. Does this work well in practice/when did it fail?

Thank you!

Overhead ultimately depends on the frequency, it defaults to 19hz per core, at which it’s less than 1%, which is tried and tested with all sorts of super heavy python, JVM, rust, etc. workloads. Since it’s per core it tends to be plenty of stacks to build statistical significance quickly. The profiler is essentially a thread-per-core model, which certainly helps for perf.

The offset approach has evolved a bit, it’s mixed with some disassembling today, with that combination it’s rock solid. It is dependent on the engine, and in the case of python only support cpython today.

Short note: Also available as the standard Otel profiling agent ;)