Hacker News new | ask | show | jobs
by danishSuri1994 217 days ago
This is a great example of the blind spot between sampling-based observability and event-driven tracing.

Anything that appears + disappears between polls is effectively invisible unless you’re streaming syscalls/process events. It’s surprising how often “short-lived, high-impact” processes cause the worst production spikes.

Curious whether you’re planning to surface this at the scheduler level (run queue latency/involuntary context switches) or stick to process-lifecycle tracing?

1 comments

Right now I’m sticking to process lifecycle (sched_process_fork and sched_process_exit), mostly for correlation. It’s much easier to grab container ID / cgroup metadata at fork time and say “this pod/image is the bad actor” than it is to reconstruct that context off a firehose of sched_switch events. I agree that run queue latency / scheduler stats are the “better” signals for pure performance debugging. But scheduler switches generate a huge volume of events compared to forks. So I’m starting with fork/exec/exit + container/cgroup mapping If you’ve shipped scheduler-level tracing in production I’d love to hear how you handled filtering + aggregation.