| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jorgeortiz85 3945 days ago

Hi, I work in infrastructure at Stripe and I'm happy to provide more insight. Several threads here have commented on our tooling and processes around index changes. I can give a bit more detail about how that works.

We have a library that allows us to describe expected schemas and expected indexes in application code. When application developers add or remove expected indexes in application code, an automated task turns these into alerts to database operators to run pre-defined tools that handle index operations.

In this situation, an application developer didn't add a new index description or remove an index description, but rather modified an existing index description. Our automated tooling erroneously handled this particular change and interpreted it not as a single intention but instead encoded it as two separate operations (an addition and a removal).

Developers describe indices directly in the relevant application/model code to ensure we always have the right indices available -- and in part to help avoid situations like this. In addition, the tooling for adding and removing indexes in production is restricted to a smaller set of people, both for security and to provide an additional layer of review (also to help prevent situations like this). Unfortunately, because of the bug above, the intent was not accurately communicated. The operator saw two operations, not obviously linked to each other, among several other alerts, and, well, the result followed.

There are some pretty obvious areas for tooling and process improvements here. We've been investigating them over the last few days. For non-urgent remediations, we have a custom of waiting at least a week after an incident before conducting a full postmortem and determining remediations. This gives us time to cool down after an incident and think clearly about our remediations for the long-term. We'll be having these in-depth discussions, and making decisions about the future of our tooling and processes, over the next week.

5 comments

asuffield 3945 days ago

(Tedious disclaimer: my opinion, not speaking for my employer, etc)

I'm an SRE at Google, where postmortems are habitual. The thing that jumped out at me here is that a production change was instantaneously pushed globally, instead of being canaried on a fraction of the serving capacity so that problems could be detected. That seems like your big problem here.

(Of course, without knowing how your data storage works, it's difficult to tell how hard it is to fix that.)

jorgeortiz85 3945 days ago

Yup.

This is one of our few remaining unsharded databases (legacy problems...), so we can't easily canary a fraction of serving capacity. However, one clear remediation we can implement easily is to have our tooling change a replica first, failover to it as primary, and, if problems are detected, quickly fail back to the healthy former primary.

Lesson learned. We'll be doing a review of all of our database tooling to make sure changes are always canaried or easily reversible.

eldavido 3945 days ago

hi jorge

I'd actually applied to work at stripe about two years ago, you guys turned me down ;)

I was responsible for ops at a billion-device-scale mobile analytics company for about 1.5 years. Your tooling is far superior to anything we produced. I like the idea of a single source of truth describing the data model (code, tables, query patterns, etc.) a lot, and doubly-so that it's revision-controlled and available right alongside the code.

I think it's far from decided though, how much to involve human operators in processes like this. Judging from this answer, you seem to be on the extreme end of "automate everything". How then, I'm curious, do you train/communicate to developers what can be done safely vs. something that would cause i/o bottlenecks, slowdown, or other potentially production-impacting effects? Can you even predict these things accurately in advance? (Some of our worst outages were caused by emergent phenomena that only manifested at production scale, such as hitting packet throughput and network bandwidth limits on memcached -- totally unforseeable in a code-only test environment).

It sounds like you let developers request changes (a la "The Phoenix Project") but ops is responsible for final approval of the change? That actually sounds like a great system. Would love some elaboration on this.

In any case, great writeup and from one guy who's been there when the pager goes off to another, sounds like the recovery went pretty smoothly.

jorgeortiz85 3945 days ago

This is indeed a tricky balance. We want developers to iterate quickly, but we also want to understand the impact of production changes. With a small team and small sets of data, it's easy for everyone to understand the impact of changes and it's easy for modern hardware to hide inefficiencies. As we grow, the balance changes. It's harder for any one person to understand everything. It's also harder to hide inefficiencies with larger data sets.

We're always learning and improving. In order to scale, we'll need better ways to manage complexity and isolate failure. Our tools, patterns, and processes have changed quite a bit over the last few years, and they will continue to change. Ultimately, we want every Stripe employee to have the right information evident to them when they make decisions. This will be challenging, especially as we grow, but I'm excited to take on that challenge.

If you're still interested in working at Stripe, I'd encourage you to reapply! Our needs have changed quite a bit since you applied, and we're willing to reconsider candidates after a year has passed. Feel free to shoot me a resume: jorge@stripe.com

toomuchtodo 3945 days ago

Shouldn't developers understand how a database change is going to impact an environment based on the code they've written?

BinaryIdiot 3945 days ago

Yes they very much should! But in my, admittedly anecdotal, experience only the best / most senior ever do. Almost every junior or mid developer I've worked with (and a small handful of senior folks) not only have no idea how changes like this would impact the larger environment but many won't even care to look into it.

noir_lord 3945 days ago

In part though that's because the tooling to do it easily absolutely sucks, the impedance mismatch (overused but in context here) between the two parts of the system causes a lot of the underlying issues, better tooling is a large part of the solution I think but I've not seen anything that would help and the surface area of a modern RDBMS is so large without even getting into vendor specific stuff I'm not sure what that would even look like.

BinaryIdiot 3945 days ago

That's certain a great point! If there was a way to automatically test much of this I bet even the newest of engineers could stop this. Doing that is tough, hmm...

noir_lord 3944 days ago

I think the only way you could do it on top of a RDBMS is to use a strict subset of features that are common (something that many ORM's already do) which reduce the problem scope down to something manageable, the issue then would be that there would always be the temptation to use something outside that subset and forgo the easier testing, fast forward and you have the same issue.

It would be interesting to build a RDBMS that enforced that subset by simply not allowing those features to be used/abused with support for many of the modern features (JSONB etc) but that is way beyond my area of expertise.

dexterdog 3945 days ago

You would think but far too many developers don't really know how databases work under load.

devit 3945 days ago

Why not just use simple version controlled database migrations, and testing them in a test environment?

tempestn 3945 days ago

Generally you want your database migrations described in a straightforward manner for development; the migrations will contain a straightforward change from old to new (and back). With a live (busy) production database, it is often necessary to handle things differently to maintain up-time.

As a simple example, to make an atomic change to a write-only table, you could create a copy of the table, alter the copy as necessary, then in a single rename operation, rename the live table to '_old' and the '_new' table to live. You most likely would not want to add two additional table schema and all of those steps to your development database operations.

It's entirely possible that they could capture what is done in production as migrations, and test them first, but it would still likely be separate from what the application developers are working with.

devit 3945 days ago

Development databases normally have a small amount of data, so migrations should execute instantly or nearly so no matter how complex they are.

tempestn 3945 days ago

True, but I don't think it negates anything I wrote. You don't keep development migrations simple so they'll run quickly; you keep them simple so they're easy to create and understand. Writing migrations (whether automated or manual) for production is a separate task and even a separate skill from designing the database structure itself, so there's no reason why the two need to be (or should be) combined.

tempestn 3945 days ago

Meant to write 'read-only' in the example there. Those steps wouldn't work well for a table that's being written to, since it could change in the process. Anyway, it was just an example.

raspasov 3945 days ago

What kind of database was the incident on?

lacksconfidence 3945 days ago

have you considered integrating index statistics into these changes? To take an example from mysql, there is the INDEX_STATISTICS table in information_schema that contains the current number of rows read from the index. Checking this twice with a one minute interval before applying the index drop could have shown that the index was under heavy usage, and might require human intervention.

codahale 3945 days ago

MongoDB doesn't track this information, unfortunately.

vostro_mf 3945 days ago

It looks like the latest version does: https://jira.mongodb.org/browse/SERVER-2227

The problem with MongoDB is that teams think they can get away by just setting it and forgetting it. Real companies have DBAs that monitor it and understand it and make a living through it. They're just trying to automate it using fancy ui's. That's what you get for trying to automate your DBAs.

codahale 3944 days ago

3.1.x is a development branch and not intended for production use. When they release 3.2, MongoDB will support it.

spudlyo 3945 days ago

That was my thought as well, but this change was done by an Operator and not a DBA, who tend to be a bit more curious about these kinds of changes.