Hacker News new | ask | show | jobs
by eldavido 3903 days ago
hi jorge

I'd actually applied to work at stripe about two years ago, you guys turned me down ;)

I was responsible for ops at a billion-device-scale mobile analytics company for about 1.5 years. Your tooling is far superior to anything we produced. I like the idea of a single source of truth describing the data model (code, tables, query patterns, etc.) a lot, and doubly-so that it's revision-controlled and available right alongside the code.

I think it's far from decided though, how much to involve human operators in processes like this. Judging from this answer, you seem to be on the extreme end of "automate everything". How then, I'm curious, do you train/communicate to developers what can be done safely vs. something that would cause i/o bottlenecks, slowdown, or other potentially production-impacting effects? Can you even predict these things accurately in advance? (Some of our worst outages were caused by emergent phenomena that only manifested at production scale, such as hitting packet throughput and network bandwidth limits on memcached -- totally unforseeable in a code-only test environment).

It sounds like you let developers request changes (a la "The Phoenix Project") but ops is responsible for final approval of the change? That actually sounds like a great system. Would love some elaboration on this.

In any case, great writeup and from one guy who's been there when the pager goes off to another, sounds like the recovery went pretty smoothly.

2 comments

This is indeed a tricky balance. We want developers to iterate quickly, but we also want to understand the impact of production changes. With a small team and small sets of data, it's easy for everyone to understand the impact of changes and it's easy for modern hardware to hide inefficiencies. As we grow, the balance changes. It's harder for any one person to understand everything. It's also harder to hide inefficiencies with larger data sets.

We're always learning and improving. In order to scale, we'll need better ways to manage complexity and isolate failure. Our tools, patterns, and processes have changed quite a bit over the last few years, and they will continue to change. Ultimately, we want every Stripe employee to have the right information evident to them when they make decisions. This will be challenging, especially as we grow, but I'm excited to take on that challenge.

If you're still interested in working at Stripe, I'd encourage you to reapply! Our needs have changed quite a bit since you applied, and we're willing to reconsider candidates after a year has passed. Feel free to shoot me a resume: jorge@stripe.com

Shouldn't developers understand how a database change is going to impact an environment based on the code they've written?
Yes they very much should! But in my, admittedly anecdotal, experience only the best / most senior ever do. Almost every junior or mid developer I've worked with (and a small handful of senior folks) not only have no idea how changes like this would impact the larger environment but many won't even care to look into it.
In part though that's because the tooling to do it easily absolutely sucks, the impedance mismatch (overused but in context here) between the two parts of the system causes a lot of the underlying issues, better tooling is a large part of the solution I think but I've not seen anything that would help and the surface area of a modern RDBMS is so large without even getting into vendor specific stuff I'm not sure what that would even look like.
That's certain a great point! If there was a way to automatically test much of this I bet even the newest of engineers could stop this. Doing that is tough, hmm...
I think the only way you could do it on top of a RDBMS is to use a strict subset of features that are common (something that many ORM's already do) which reduce the problem scope down to something manageable, the issue then would be that there would always be the temptation to use something outside that subset and forgo the easier testing, fast forward and you have the same issue.

It would be interesting to build a RDBMS that enforced that subset by simply not allowing those features to be used/abused with support for many of the modern features (JSONB etc) but that is way beyond my area of expertise.

You would think but far too many developers don't really know how databases work under load.