Hacker News new | ask | show | jobs
by rwiggins 882 days ago
Aaaahhh, it's crazy how much this incident resonates with me!

I've personally handled this exact same kind of outage more times than I'd care to admit. And just like the fine folks at Kagi, I've fallen into the same rabbit hole (database connection pool health) and tried all the same mitigations - futilely throwing new instances at the problem, the belief that if I could just "reset" traffic it'd all be fixed, etc...

It doesn't help that the usual saturation metrics (CPU%, IOPS, ...) for databases typically don't move very much during outages like these. You see high query latency, sure, but you go looking and think: "well, it still has CPU and IOPS headroom..." without realizing, as always, lock contention lurks.

In my experience, 98% of the time, any weirdness with DB connection pools is a result of weirdness in the DB itself. Not sure what RDBMS Kagi's running, but I'd highly recommend graphing global I/O wait time (seconds per second) and global lock acquisition time (seconds per second) for the DB. And also query execution time (seconds per second) per (normalized) query. Add a CPU utilization chart and you've got a dashboard that will let you quickly identify most at-scale perf issues.

Separately: I'm a bit surprised that search queries trigger RDBMS writes. I would've figured the RDBMS would only be used for things like user settings, login management, etc. I wonder if Kagi's doing usage accounting (e.g. incrementing a counter) in the RDBMS. That'd be an absolute classic failure mode at scale.

1 comments

I was wondering the same thing.

They would have some writes indirectly due to searches, say if someone chooses to block a search result. They’re also going to have some history and analytics surely.

But yeah it’s not obvious what should cause per search write lock contention…

You know, in retrospect, I think Kagi expects O(thousands) searches per month per user, so doing per-user usage accounting in the DB is fine -- thanks to row-level locking.

Well, at least until you get a user who does 60k "in a short time period"... :-)

It's the outliers and "surely nobody could be THAT awful" that kill you. Every time.
I once had a stock alert product running on a backend I wrote. One person signed up for alerts for every single Nasdaq ticker there was. We didn’t expect that.