Hacker News new | ask | show | jobs
by parth21shah 217 days ago
Right now I’m sticking to process lifecycle (sched_process_fork and sched_process_exit), mostly for correlation. It’s much easier to grab container ID / cgroup metadata at fork time and say “this pod/image is the bad actor” than it is to reconstruct that context off a firehose of sched_switch events. I agree that run queue latency / scheduler stats are the “better” signals for pure performance debugging. But scheduler switches generate a huge volume of events compared to forks. So I’m starting with fork/exec/exit + container/cgroup mapping If you’ve shipped scheduler-level tracing in production I’d love to hear how you handled filtering + aggregation.