Hacker News new | ask | show | jobs
by ianamartin 2164 days ago
I love articles like this because it's so easy to just add that company to a list of places to never ever work.

I did read the whole article, btw. It's an absolute clickbait title that the author doesn't really mean, and after the article spends a lot of time diffusing the clickbait title it really boils down to, "This is hard, so I give up."

It's true that many--if not most--companies operate this way without ever acknowledging it. And that's bad. It's also true that systems are harder to test than code. But it's not deep fucking magic. Look at the work aphyr does. Look at the testing work that the FoundationDB team did to prove their system's guarantees. Look at the work that security and devops people do every day. She is right that it is hard to test systems. So what? We don't get paid as much as we do because it's easy.

In a certain environment, it is truly impossible to test a system. That's when you have a dev culture that refuses to actually design knowable systems. A much better approach for the article would be to address exactly why systems are so hard to test rather than just saying fuck it. Everything she cites in her list of things that are hard to test are absolutely testable, if you have a knowable system. The real problem here is that agile/scrum/Xtreme programming practices inevitably and by principle do not result in knowable, testable systems. When you have 30+ agile teams on their own sprint cycles and product managers leaning on them to ship features and figure the rest out later, there can be no other result than fragile, broken, unknowable, untestable system.

But the answer to that isn't "Everybody else is doing it so why can't I." The answer isn't to "embrace it." The answer isn't "This is hard, fuck it." The answer is most definitely not to make individual engineers pay the price of being on call because a company's culture and process are totally and completely hosed.

The answer is to address the problems in your company that caused this situation in the first place. The answer is to get your head out of the feature cult and the velocity war and reset your priorities. Systems aren't hard because your engineers suck. They're hard because companies suck. Systems are hard because in most places, no one is allowed to spend more than a couple minutes thinking about the systems.

Agile culture after your early startup cycle is a lot like being a 40 year old guy who's 30 lbs over weight. How did this happen? How did I get here? I was just taking life one thing at a time and getting shit done. Now nothing works quite as well as it used to, it's harder to find dates, and everything just sort of hurts. Would anyone in their right mind just say, "Embrace it! Most 40 year old tech dudes look about like you and are in the same situation! It's fine!" No. Of course not. You have to realize that your priorities have been totally broken for the last 15-20 years of your life, that you really weren't getting shit done, and you have to take some responsibility for your diet and get off your ass and exercise.

That's what companies have to do. They won't, of course. But they have to, otherwise they'll die young deaths. This article is totally correct when she recognizes a terrible symptom of unhealthy companies. But her treatment is hopelessly and tragically wrong.

1 comments

This works great if you're building something with a tightly controlled API.

If, however, your configuration space grows to an even middling size, it no longer becomes feasible to do much of this validation across the configuration space. A good example is any system where the user can customize system aspects. Do you run all of your integration tests across the full configuration space?

Additionally managing configuration skew between a dev and prod environment is not simple. Simply claiming that there should be no skew doesn't work. Often you want the prod and dev environments to run as different users, and you certainly want them to have different acls (your dev environment should not have access to your production database).

So you now have to, across your configuration space, validate that only the things that are "supposed" to be different differ, and that the things that aren't don't. Which maybe works for a while, but your prod configuration may also differ across parts of prod if, for example, a change is being canaried or incrementally deployed.

I've spent a non-trivial amount of effort on trying to solve the one problem of configuration skew between dev and prod for one real system. It's ultimately not worth it. The effort expended to "fix" that would be more work, than not. And I mean that in the long term, the effort to maintain and follow the rules that such a system would impose is more effort than dealing with the annoyances of unintended skews.

Systems are hard because systems are hard. There's no good company that doesn't, test/experiment in production. All of them do.

Thats a weird point you are making.

Yes i would test basic / standard customizations a customer would do.

I would test the customization system itself.

If it is to complex, your cusotmiztaion system will bring you much more issues later on.

I didn't say that any of this was simple or non-trivial. Again, it depends on your priorities and your values as well as your company culture. In fact, I specifically said that testing systems are hard and provided examples of how hard systems are to test. Do you think that Cassandra is a tightly controlled API with a small configuration space?

You seem to feel like close enough is good enough. And that's the cause of the problem I'm trying to address here. Does it really matter if you don't get a notification when someone messages you on Facebook? Or if you get two notifications? Is that particular problem worth testing every possible Kafka configuration? I think that you are saying is no, it doesn't matter.

But I'm arguing a different point. I'm not arguing about whether the testability of any individual feature is important. For obvious reasons: some features really just aren't that important. But not being able to do that, and actively choosing not to understand that system is a symptom of a far deeper problem. When a company makes the choice you have just described, the company has decided to accept that they can't, won't, and will never fully understand their own systems. It's often not a conscious decision, it's a decision made by habit, policy, and culture, which is what's so subversive about it. People don't make big-picture decisions to intentionally have a system that is unknowable/untestable. People make small decisions just like the ones you are talking about that make systems that way. And it's the practice of letting lots of disconnected people make the small decisions of what does and doesn't matter, what is and isn't worth it that destroys systems.

Systems are hard, and I agree with that, but systems are made even more so by bad process.

The being old analogy didn't seem to resonate with you, which is fine. But let me ask you a question about a system.

You have a database. It gets backed up every night. Or maybe every hour. Your job is to take snapshots and store them because that's what you're supposed to do. Yeah, I know, that should be or can be automated. Whatever.

The big picture system and purpose is that you are supposed to be able to recover from a hardware failure/data loss. But that's not your problem. Your problem is that you have to back up the database manually every day. The data team only tests restoring backups from dev to dev instead of prod to dev. Because reasons. Because it's hard.

That type of backup system checks all the boxes you're supposed to check when you get audited. Or at least enough to get through it. But when you really need to understand the system, it fails for all kinds of reasons and people are sitting around looking at each other saying, "well I did what I was supposed to do."

Individuals sitting around making isolated, disconnected decisions like the ones you're talking about (i.e., it just isn't worth it; it's not feasible; it's hard) compound in organizations and create the kinds of systems you don't want to deal with. You're making your own hell here. You seemed to have missed that key point in my earlier comment.

Laziness is a good trait in an individual programmer. But laziness is the absolute death of an organization. Agile is really just distributed, organizational laziness. That's what creates horrible, unknowable systems.

Conflating test/experiment with what the original article claimed to be talking about (and then later walked back) is borderline disingenuous. No one is talking about A/B testing or intentional experiments.

The article is talking about rolling the dice in production deployments and claiming that's fine and something to be proud of. It isn't fine, and it's not something to be proud of. She's the CEO. She should fix her company instead of being proud of how bad it is.

A lot of what we're talking about here is a matter of perspective. And that is the problem I'm taking to task both with you and with the article.

> I think that you are saying is no, it doesn't matter.

No I'm not saying that. I'm saying that the best way to prevent that isn't always to have a staging environment that mirrors production as well as you can.

> Individuals sitting around making isolated, disconnected decisions like the ones you're talking about (i.e., it just isn't worth it; it's not feasible; it's hard) compound in organizations and create the kinds of systems you don't want to deal with. You're making your own hell here. You seemed to have missed that key point in my earlier comment.

No, this was an intentional decision by the organization, that the organization shouldn't continue to invest time in solving the problem this way, because after significant effort expended by the organization, the conclusion of the people who the organization asked to investigate the problem was that solutions would not be feasible and would not improve things. You're acting like these decisions are always made in a vacuum. They're not. Often smart organizations investigate and make decisions at the level of leadership.

> Conflating test/experiment with what the original article claimed to be talking about (and then later walked back) is borderline disingenuous. No one is talking about A/B testing or intentional experiments.

Are you sure?

FTA:

> We conduct experiments in risk management every single day, often unconsciously. Every time you decide to merge to master or deploy to prod, you’re taking a risk.

> A healthy culture of experimentation and testing in production pulls together all three.

Canarying is just testing in production, but you have processes and "guardrails" (quoting the article) to make sure that it is done safely by default.

For the record, I work primary on reliability and release/experiment, and so I'm well aware that being explicit about your decisions is vital, as is knowing the tradeoffs involved. That's why pretending that you don't test in prod is a bad idea, because you almost assuredly do. That's what the article is saying.

Edit: As for Cassandra, it looks like they have system bugs caught in production, so I'm not sure what your point is (https://issues.apache.org/jira/projects/CASSANDRA/issues/CAS...)