Hacker News new | ask | show | jobs
by bob1029 1883 days ago
I love the test-in-production illustration. This is a fairly accurate map of my journey as well. I vividly remember scolding our customers about how they needed a staging environment that was a perfect copy of production so we could guarantee a push to prod would be perfect every time.

We still have that customer but we have learned a very valuable & expensive lesson.

Being able to test in production is one of the most wonderful things you can ever hope to achieve in the infra arena. No ambiguity about what will happen at 2am while everyone is asleep. Someone physically picked up a device and tested it for real and said "yep its working fine". This is much more confidence inspiring for us.

I used to go into technical calls with customers hoping to assuage their fears with how meticulous we would be with a test=>staging=>production release cycle. Now, I go in and literally just drop the "We prefer to test in production" bomb on them within the first 3 sentences of that conversation. You would probably not be surprised to learn that some of our more "experienced" customers enthusiastically agree with this preference right away.

3 comments

You can test in production by moving fast and breaking things (the clueless guy).

You can test in production by having canaries, filters, etc, and allowing some production traffic to the version under test. This is the "wired" guy.

For many backend things, you can test in production by shadowing the current services, and copying the traffic into your services under test. It has limitations, but can produce zero disruption.

Until you discover that prod has an edgecase string that has the blood mother of cthulu's last name written in swahili-hebrew, (you know it as swabrew, or SHE for short) which due to a lack of goat blood on migration day is one of the ~3% of edge cases that aren't replicated and now youve got an entirly new underground civilization born that is expecting you to serve them traffic with 5 9's and you can only do 4 because of the flakiness.
IMO this still keeps production a previous thing. Next comes multi-version concurrent environments on the same replicated data streams. There is only production and delivery to customers is access version 3 instead of version 2. This can be proceeded by an automated comparison of v2's outputs to v3's and a review (or code description of the expected differences).
you still have unit tests and similar right? and maybe test in a staging env lightly first? or, right to Live?
Yes we have a 2 stage process with the customer. There is still a test & production environment, but test talks to all production business systems too. Only a few people are allowed into this environment. We test with 1-5 users for a few days then decide to take it to the broader user base.
> but test talks to all production business systems too

I'm not sure I understand this, would you mind explaining more? Do you mean you have multi-tenancy in databases and application (the "tenants" being stage/test and prod)?

I think what op’s referring to is a staging app version that talks to production services and databases. Eg you have a “clone” of your production UI, accessible only to devs. This clone is configured to talk to the same DB and call the same dependencies as the production service but since it’s access is limited it’s used by devs to test their new feature(s).

This pattern is used where I work too. It’s been incredibly powerful. Some of our services allow for multiple staging replicas so that nobody is blocking one another; the replicas are ephemeral and nuked after testing.

Right but I'm wondering how you avoid polluting your production dataset with test data, hence the question about multi-tenancy in the database. Do you use a different schema? Discriminator column? How do you know the data came from the "clone"?
Yes. This is essentially the idea.
This is how to build a scaling DoS issue that bombs prod iirc