|
Hi there, I'm Zac, Kagi's tech lead / author of the post-mortem etc. This has 100% been a learning experience for us, but I can provide some finer context re: observability. Kagi is a small team. The number of staff we have capable of responding to an event like this is essentially 3 people, seated across 3 timezones. For myself and my right-hand dev, this is actually our very first step in our web careers - this is to say that we are not some SV vets who have seen it all already. To say that we have a lot to learn is a given, from building Kagi from nothing though, I am proud of how far we've come & where we're going. Observability is something we started taking more seriously in the past 6 months or so. We have tons of dashboards now, and alerts that go right to our company chat channels and ping relevant people. And as the primary owner of our DB, GCP's query insights are a godsend. During the incident both our monitoring went off, as well as query insights showing the "culprit" query - but, we could have monitoring in the world, and still lack the experience to interpret it and understand what the root cause is or most efficient action to mitigate is. In other words, we don't have the wisdom yet to not be "gaslit" by our own systems if we're not careful. Only in hindsight can I say that GCP's query insights was 100% on the money, and not some bug in application space. All said, our growth has enabled us to expand our team quite a bit now. We have had SRE consultations before, and intend to bring on more full or part-time support to help keep things moving forward. |
Don't worry too much about all the people being harsh in the comments here. There's always a tendency for HN users to pile on with criticism whenever anyone has an outage.
I've always found this bizarre, because I've worked at places with worse issues, and more holes in monitoring or whatever than a lot of the companies that get skewered here. Perhaps many of us are just insecure about our own infra and project our feeling onto other companies when they have outages.
Y'all are doing fine, and I think it's to your credit that you're able to run Kagi's users table off a single, fairly cheap primary database instance. I've worked at places that haven't much thought to optimization, and "solve" scaling problems by throwing more and bigger hardware at it, and then wonder later on why they're bleeding cash on infrastructure. Of course, by that point, those inefficiencies are much more difficult to fix.
As for monitoring, unfortunately sometimes you don't know everything you need to monitor until something bad happens because your monitoring was missing something critical that you didn't realize was critical. That's fine; seems like y'all are aware and are plugging those holes. I'm sure there will be more of those holes in the future, that's just life.
At any rate, keep doing what you're doing, and I know the next time you get hit with something bad, things will be a bit better.