| Lots to unpack, here. How old is the company? There's certainly a stage where features are critical, but there's also one where you have 80% coverage, and it's time to harden them and clarify the product and feature set? What does customer churn look like? That'll tell you how bad the problem really is. The alerting situation sounds bad. Sometimes it can be that they're mostly false alarms, in which case people will ignore them. The fix there is spending a few days tuning the thresholds and cleaning up unmeaningful alerts. This is also an opportunity to drive this. If people are ignoring real alerts, that's bad, and it's harder to fix. What would happen in an actual outage? You're at a startup. Have you talked with anyone on the sales side about what they're hearing? Is it hurting sales or retention? Look for people with titles like account manager, account executive, solutions engineer, etc. How small is the startup? If it's less than 150 people, the signs actually point to it being a somewhat real issue, and you don't desperately need the job, you can book a meeting with someone high up. CEOs at startups have open door policies for things like this that slip through the cracks. Just do your homework beforehand, only do this rarely, and be ready for it to possibly backfire (but if it does, you don't want to work there, anyway). Keep in mind that all codebases are shit, especially the one you're currently working on. If things are bad enough that it hurts sales or retention, and there's no push from sales or drive in engineering to fix things. I'd leave. Remember that part of your compensation is equity, and the company's success depends on the success of the customers. Another way of framing issues only happening 1% of the time is you only get beat up with a baseball bat on your way to the office twice per year. One place I worked had a known issue we were very unhappy with where the p95 latency was .5s, p99 was 1s, and p999 was 10s. People feel p99+ more, and it's usually better to track these metrics on the high-end. And the obligatory "it's a feature, not a bug." |