| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ianamartin 2164 days ago

I didn't say that any of this was simple or non-trivial. Again, it depends on your priorities and your values as well as your company culture. In fact, I specifically said that testing systems are hard and provided examples of how hard systems are to test. Do you think that Cassandra is a tightly controlled API with a small configuration space?

You seem to feel like close enough is good enough. And that's the cause of the problem I'm trying to address here. Does it really matter if you don't get a notification when someone messages you on Facebook? Or if you get two notifications? Is that particular problem worth testing every possible Kafka configuration? I think that you are saying is no, it doesn't matter.

But I'm arguing a different point. I'm not arguing about whether the testability of any individual feature is important. For obvious reasons: some features really just aren't that important. But not being able to do that, and actively choosing not to understand that system is a symptom of a far deeper problem. When a company makes the choice you have just described, the company has decided to accept that they can't, won't, and will never fully understand their own systems. It's often not a conscious decision, it's a decision made by habit, policy, and culture, which is what's so subversive about it. People don't make big-picture decisions to intentionally have a system that is unknowable/untestable. People make small decisions just like the ones you are talking about that make systems that way. And it's the practice of letting lots of disconnected people make the small decisions of what does and doesn't matter, what is and isn't worth it that destroys systems.

Systems are hard, and I agree with that, but systems are made even more so by bad process.

The being old analogy didn't seem to resonate with you, which is fine. But let me ask you a question about a system.

You have a database. It gets backed up every night. Or maybe every hour. Your job is to take snapshots and store them because that's what you're supposed to do. Yeah, I know, that should be or can be automated. Whatever.

The big picture system and purpose is that you are supposed to be able to recover from a hardware failure/data loss. But that's not your problem. Your problem is that you have to back up the database manually every day. The data team only tests restoring backups from dev to dev instead of prod to dev. Because reasons. Because it's hard.

That type of backup system checks all the boxes you're supposed to check when you get audited. Or at least enough to get through it. But when you really need to understand the system, it fails for all kinds of reasons and people are sitting around looking at each other saying, "well I did what I was supposed to do."

Individuals sitting around making isolated, disconnected decisions like the ones you're talking about (i.e., it just isn't worth it; it's not feasible; it's hard) compound in organizations and create the kinds of systems you don't want to deal with. You're making your own hell here. You seemed to have missed that key point in my earlier comment.

Laziness is a good trait in an individual programmer. But laziness is the absolute death of an organization. Agile is really just distributed, organizational laziness. That's what creates horrible, unknowable systems.

Conflating test/experiment with what the original article claimed to be talking about (and then later walked back) is borderline disingenuous. No one is talking about A/B testing or intentional experiments.

The article is talking about rolling the dice in production deployments and claiming that's fine and something to be proud of. It isn't fine, and it's not something to be proud of. She's the CEO. She should fix her company instead of being proud of how bad it is.

A lot of what we're talking about here is a matter of perspective. And that is the problem I'm taking to task both with you and with the article.

1 comments

joshuamorton 2163 days ago

> I think that you are saying is no, it doesn't matter.

No I'm not saying that. I'm saying that the best way to prevent that isn't always to have a staging environment that mirrors production as well as you can.

> Individuals sitting around making isolated, disconnected decisions like the ones you're talking about (i.e., it just isn't worth it; it's not feasible; it's hard) compound in organizations and create the kinds of systems you don't want to deal with. You're making your own hell here. You seemed to have missed that key point in my earlier comment.

No, this was an intentional decision by the organization, that the organization shouldn't continue to invest time in solving the problem this way, because after significant effort expended by the organization, the conclusion of the people who the organization asked to investigate the problem was that solutions would not be feasible and would not improve things. You're acting like these decisions are always made in a vacuum. They're not. Often smart organizations investigate and make decisions at the level of leadership.

> Conflating test/experiment with what the original article claimed to be talking about (and then later walked back) is borderline disingenuous. No one is talking about A/B testing or intentional experiments.

Are you sure?

FTA:

> We conduct experiments in risk management every single day, often unconsciously. Every time you decide to merge to master or deploy to prod, you’re taking a risk.

> A healthy culture of experimentation and testing in production pulls together all three.

Canarying is just testing in production, but you have processes and "guardrails" (quoting the article) to make sure that it is done safely by default.

For the record, I work primary on reliability and release/experiment, and so I'm well aware that being explicit about your decisions is vital, as is knowing the tradeoffs involved. That's why pretending that you don't test in prod is a bad idea, because you almost assuredly do. That's what the article is saying.

Edit: As for Cassandra, it looks like they have system bugs caught in production, so I'm not sure what your point is (https://issues.apache.org/jira/projects/CASSANDRA/issues/CAS...)