| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by joshuamorton 2164 days ago

This works great if you're building something with a tightly controlled API.

If, however, your configuration space grows to an even middling size, it no longer becomes feasible to do much of this validation across the configuration space. A good example is any system where the user can customize system aspects. Do you run all of your integration tests across the full configuration space?

Additionally managing configuration skew between a dev and prod environment is not simple. Simply claiming that there should be no skew doesn't work. Often you want the prod and dev environments to run as different users, and you certainly want them to have different acls (your dev environment should not have access to your production database).

So you now have to, across your configuration space, validate that only the things that are "supposed" to be different differ, and that the things that aren't don't. Which maybe works for a while, but your prod configuration may also differ across parts of prod if, for example, a change is being canaried or incrementally deployed.

I've spent a non-trivial amount of effort on trying to solve the one problem of configuration skew between dev and prod for one real system. It's ultimately not worth it. The effort expended to "fix" that would be more work, than not. And I mean that in the long term, the effort to maintain and follow the rules that such a system would impose is more effort than dealing with the annoyances of unintended skews.

Systems are hard because systems are hard. There's no good company that doesn't, test/experiment in production. All of them do.

2 comments

hampfi 2163 days ago

Thats a weird point you are making.

Yes i would test basic / standard customizations a customer would do.

I would test the customization system itself.

If it is to complex, your cusotmiztaion system will bring you much more issues later on.

link

ianamartin 2164 days ago

I didn't say that any of this was simple or non-trivial. Again, it depends on your priorities and your values as well as your company culture. In fact, I specifically said that testing systems are hard and provided examples of how hard systems are to test. Do you think that Cassandra is a tightly controlled API with a small configuration space?

You seem to feel like close enough is good enough. And that's the cause of the problem I'm trying to address here. Does it really matter if you don't get a notification when someone messages you on Facebook? Or if you get two notifications? Is that particular problem worth testing every possible Kafka configuration? I think that you are saying is no, it doesn't matter.

But I'm arguing a different point. I'm not arguing about whether the testability of any individual feature is important. For obvious reasons: some features really just aren't that important. But not being able to do that, and actively choosing not to understand that system is a symptom of a far deeper problem. When a company makes the choice you have just described, the company has decided to accept that they can't, won't, and will never fully understand their own systems. It's often not a conscious decision, it's a decision made by habit, policy, and culture, which is what's so subversive about it. People don't make big-picture decisions to intentionally have a system that is unknowable/untestable. People make small decisions just like the ones you are talking about that make systems that way. And it's the practice of letting lots of disconnected people make the small decisions of what does and doesn't matter, what is and isn't worth it that destroys systems.

Systems are hard, and I agree with that, but systems are made even more so by bad process.

The being old analogy didn't seem to resonate with you, which is fine. But let me ask you a question about a system.

You have a database. It gets backed up every night. Or maybe every hour. Your job is to take snapshots and store them because that's what you're supposed to do. Yeah, I know, that should be or can be automated. Whatever.

The big picture system and purpose is that you are supposed to be able to recover from a hardware failure/data loss. But that's not your problem. Your problem is that you have to back up the database manually every day. The data team only tests restoring backups from dev to dev instead of prod to dev. Because reasons. Because it's hard.

That type of backup system checks all the boxes you're supposed to check when you get audited. Or at least enough to get through it. But when you really need to understand the system, it fails for all kinds of reasons and people are sitting around looking at each other saying, "well I did what I was supposed to do."

Individuals sitting around making isolated, disconnected decisions like the ones you're talking about (i.e., it just isn't worth it; it's not feasible; it's hard) compound in organizations and create the kinds of systems you don't want to deal with. You're making your own hell here. You seemed to have missed that key point in my earlier comment.

Laziness is a good trait in an individual programmer. But laziness is the absolute death of an organization. Agile is really just distributed, organizational laziness. That's what creates horrible, unknowable systems.

Conflating test/experiment with what the original article claimed to be talking about (and then later walked back) is borderline disingenuous. No one is talking about A/B testing or intentional experiments.

The article is talking about rolling the dice in production deployments and claiming that's fine and something to be proud of. It isn't fine, and it's not something to be proud of. She's the CEO. She should fix her company instead of being proud of how bad it is.

A lot of what we're talking about here is a matter of perspective. And that is the problem I'm taking to task both with you and with the article.

link

joshuamorton 2164 days ago

> I think that you are saying is no, it doesn't matter.

No I'm not saying that. I'm saying that the best way to prevent that isn't always to have a staging environment that mirrors production as well as you can.

> Individuals sitting around making isolated, disconnected decisions like the ones you're talking about (i.e., it just isn't worth it; it's not feasible; it's hard) compound in organizations and create the kinds of systems you don't want to deal with. You're making your own hell here. You seemed to have missed that key point in my earlier comment.

No, this was an intentional decision by the organization, that the organization shouldn't continue to invest time in solving the problem this way, because after significant effort expended by the organization, the conclusion of the people who the organization asked to investigate the problem was that solutions would not be feasible and would not improve things. You're acting like these decisions are always made in a vacuum. They're not. Often smart organizations investigate and make decisions at the level of leadership.

> Conflating test/experiment with what the original article claimed to be talking about (and then later walked back) is borderline disingenuous. No one is talking about A/B testing or intentional experiments.

Are you sure?

FTA:

> We conduct experiments in risk management every single day, often unconsciously. Every time you decide to merge to master or deploy to prod, you’re taking a risk.

> A healthy culture of experimentation and testing in production pulls together all three.

Canarying is just testing in production, but you have processes and "guardrails" (quoting the article) to make sure that it is done safely by default.

For the record, I work primary on reliability and release/experiment, and so I'm well aware that being explicit about your decisions is vital, as is knowing the tradeoffs involved. That's why pretending that you don't test in prod is a bad idea, because you almost assuredly do. That's what the article is saying.

Edit: As for Cassandra, it looks like they have system bugs caught in production, so I'm not sure what your point is (https://issues.apache.org/jira/projects/CASSANDRA/issues/CAS...)

link