| I'm doing research on how oncall breaks down as engineering teams scale, specifically at Series A/B stage companies (roughly 15-30 engineers). The hypothesis I'm exploring: oncall workflows were designed for small teams where everyone knows the codebase. They break at scale because of specialization and context loss before volume becomes a problem. Some patterns I've heard so far: - Oncall engineers spending 10-20% of time just routing bugs to the right owner - Bug reports from CS/sales teams missing critical context (no logs, vague repro steps) - "Couple hours" average per bug, mostly on investigation not fixing - Session replay tools too expensive to run at meaningful coverage I'm trying to figure out: - Is this universal, or just specific to certain tech stacks/org structures? - What's the actual breaking point - team size, user scale, something else? - Has anyone solved this well? What worked? If you're currently running oncall at a growing company, I'd love to hear: - What percentage of oncall time is triage/routing vs. actual fixing? - What broke first as you scaled? - What have you tried? |