Hacker News new | ask | show | jobs
by Karrot_Kream 1092 days ago
It's interesting. I work for what is now a Big Tech company, but I started before the IPO.

We built a lot of in-house infra. We had few runbooks. We ran tons of high-scale services with a pretty low corresponding headcount. And we had a great culture, that was definitely more personality based than the rest of Big Tech. Individuals had Opinions (TM). On-call was about having a durable understanding of the system and sleuthing around to figure out what was happening (along with a strong emphasis on learning from failures and putting systems/practices into place to fix those.) And we, as engineers, were weird. My tech lead rolled into work at 11 AM every morning on their skateboard. I used to take a 2 hour daily break from 2-4 PM, go home at 6, and work again between 8-10 PM (ish, this was flexible) before I went to bed and was known to enjoy coffee with fried chicken. Our infrastructure and our services reflected our personality quirks and the interpersonal relationships we had in the office. On-call was one of my favorite things at the company because all the different personalities would weigh in with their different quirks and would offer different perspectives on big problems, and together we'd solve tough issues.

Somewhere along the ZIRP-age this all changed. We needed to become a Real Company (TM) now. We needed to have runbooks. We needed to have Boring Technology. We were poised for huge growth after all (aka our stock value shot through the roof.) The new managers we brought in needed to build out new teams to grow fast, fast, fast. How can we grow when we had such immature processes? We started building runbooks and hiring teams to manage them. Incidents started requiring 3-pages minimum of paperwork to document, which was enforced by an incident management team. Teams dreaded these incidents and where once we collaborated to fix problems, now teams became defensive and combative during incidents. Now we needed the incident management team to force engineers to cooperate during incidents because each team was trying to be as defensive as possible. Managers stopped thinking in terms of individuals with strengths and weaknesses and started thinking about headcount, both cost and allocation. Headcount became everything. With this change, attrition also spiked.

What used to take 1-2 engineers to spin up over a week or two now took months. We load tested our new services, but now we needed to make load tests repeatable and runnable in a runbook. Teams became extra defensive about features because the cost of every incident became so high. Nobody wanted to be the team that missed an integration test which came up as a cause of an incident. Our net reliability didn't actually change though the thinking was that repeatability would allow more seamless swapping of headcount. Program managers managed migrations and we started creating status meeting meetings which rolled up statuses of multiple ongoing initiatives cascading over multiple teams.

My own experience at the company went from having fun writing code and interacting with quirky folks at work to dealing with engineers who were dotting their is and crossing their ts in every aspect of their job. The managers treated us as headcount and so headcount we became. It's been a highly depressing arc to a job I loved and where I built a lot of high-scale code at, but perhaps the most frustrating has been watching our velocity decrease despite our headcount ballooning due to the overhead of programs, migrations, and incident management that developed their own bureaucracies. The saddest part is that the oldest parts of the company have become so essential and the bureaucracy so thick around them that replacing them has become really challenging. The parts of the company that were developed with the least care are the hardest to replace.

2 comments

Great comment, I think a lot of us can find something in it that we relate to, even if not the whole story.

> Nobody wanted to be the team that missed an integration test which came up as a cause of an incident. Our net reliability didn't actually change though

This caught my eye because I’ve seen similar phenomena as well. No matter how well meaning those processes are very rarely do apps get significantly more reliable. It still seems entirely down to how complex the software is, it’s very hard to integration test your way to multiple 9s if it isn’t simple (some companies do it, but at bigger cost and slower velocity).

As for some other points…

I’ve come to believe change in culture is really something inevitable after a certain number though. And that number for an engineering team is not even hundreds, but may be only dozens. You operate less as a team and more like an army, with all the personnel issues that come with it.

What interests me lately is the idea that, for many companies, 80% of the value/growth seems to be delivered when the company is relatively small at X headcount. Then increasing the headcount to 5X does not deliver much more than incremental changes built on top of the core product.

Now if you have that many employees you need to find things for them to do, like managing themselves, creating processes or going off on new market bets, and you risk losing focus.

But at the same time a company with 5X headcount seems more attractive to investors than one with X. The measure of success being only an IPO means there isn’t much pride in being well-run, small and focused, admitting that you’ll never be the top player in the market. That precipitates changes in developer culture.

>Somewhere along the ZIRP-age this all changed. We needed to become a Real Company (TM) now. We needed to have runbooks. We needed to have Boring Technology. We were poised for huge growth after all (aka our stock value shot through the roof.) The new managers we brought in needed to build out new teams to grow fast, fast, fast.

I agree. We should have had a cleansing cycle during the financial meltdown and cleared off all the madness. Instead money was printed and the party had restarted with even more stupidity.