I have a lot of bitter things to say about automated testing, having spent 14 years of my life trying to knead it into a legitimate profession, but here's the most significant:
You test case is more useless than a turd in the middle of the dining room table unless you put a comment in front of it that explains what it assumes, what it attempts, and what you expect to happen as a result.
Because if you just throw in some code, you're only giving the poor bastard investigating it two puzzles to debug instead of one.
At an old job, one manager would put in his employees' annual reports stuff like "Developer X wrote N automated tests, fixed M bugs, and filed P new bugs this quarter..."
The obvious result of Goodhart's Law ensued, leading to test cases like you mention.
Lesson to leaders: Please stop your bad managers from pulling stupid crap like this. It wastes a lot more time in the longer run.
Which is funny as the purpose of testing is to explain to other other developers what the code under test assumes and what should be expected of it under various conditions. It is documentation.
If you have to document your documentation, you might be missing something fundamental in how you are writing your first order documentation. Not to mention that in doing so you defeat the reason for writing your documentation in an executable form (to be able to automatically validate that the documentation is true).
So i understand correctly, that your position is "code is the documentation"?
Over time im inclined to value human written documentation. Especially when things involve integrations of multiple systems. I had real cases, where two parties point at code and say their code is correct. And in isolation code looks correct. But when time comes to integrate these systems. It breaks. And then if you have human readable document where intentions and expectations are specified it's much easier to come to common (working) solution.
Not all languages have capability to express complex intentions so code as documentation does not work most of the time.
Code as documentation feels like a good idea because code is the only reliable source of truth. But it also assumes that code can comprehensively express all assumptions and other info, which sounds more like wishful thinking.
Auto-generated API docs combined with handwritten documentation that covers what can't be expressed in code and includes some useful examples seems like the right approach to me. In practice that's the kind of doc I tend to have the best experience with. For example the Rust stdlib docs are auto-generated but the language also supports notes and (automatically unit-tested) examples in docstrings which means the API docs are filled with explanations & examples and mentions what assumptions are made about inputs.
I built this framework coz while I didn't believe in "code as documentation" I did believe that example based specifications, tests and documentation were all sort of the same thing (triality):
The difference between this and behave/cucumber is that the A) specification language allows for more complex representations and B) there's a templating step to generate readable documentation.
I'm not sure if you're saying that rust stdlib docs do this but documentation where all the examples are themselves runnable as tests and included in the CI test suite solves so many problems.
Same. I’m sick of people escaping writing documentation by saying that "code is the doc" and in the meantime, writing unreadable code abstracted over dozens of code files.
They almost convinced me somewhere in my career. But the hard truth I learnt is that most people are saying this because they aren’t capable of verbalizing what they are programming.
If your "code is doc", it should be extremely easy to add a little sentence above your method to explain what it does. And no, doc doesn’t stale. If your documentation isn’t up to what your function does, it’s probably because you should have written a brand new function instead of changing a function’s behavior.
The assumption in this that doesn’t fit my experience is that it assumes that someone that writes unreadable code abstracted over dozens of files, is going to be able to write clear, expressive and complete documentation.
In my experience if they can do the latter the former isn’t a problem. But since many people can’t you are left with bad code, littered with bad (often contradictory) comments which makes the problem worse not better.
> But the hard truth I learnt is that most people are saying this because they aren’t capable of verbalizing what they are programming.
I completely agree with you, ie right now doing bunch of data migration code that is awful 200 lines on first look, but does quite clever transformations, handles various data corner cases, manages lots of threads, is already quite optimized (had 30x speed increase just over last week's state and not yet done with it) etc. and... is full of little green one-liners explaining why certain logic is happening, why at given place, and not elsewhere, and how it helps later in the code.
Its even one-off migration, and its mostly for me only. But I still put comments in, have enough experience to know I will keep using those comments in further optimizations, and I know by heart that many one-off efforts end up being re-used later. Code dense with logic shouldn't require you to re-read it all to have constant full mental model of it and all its branching and possibilities just because you want to tweak it a bit.
The important point is to evolve those comments with code, otherwise they become worse than no comments at all. This is where most folks hit the wall - they are simply too lazy or undisciplined for that.
> They almost convinced me somewhere in my career. But the hard truth I learnt is that most people are saying this because they aren’t capable of verbalizing what they are programming.
I'd say that's true, and it's worth noting at this point that expressing certain things in natural language is hard. The strict rules of programming languages mean that you can reason about programs to a complexity level that would otherwise be unreachable. Notation as a tool of thought. The corollary is that there may not be a simple natural language equivalent of the code you're writing, and that adding documentation might be more effort than it's worth.
Code is documentation, but it only tells you a part of the story. Good comments can explain why, but without writing comment essays it's usually not sufficient.
And, as you note, when integrating systems you need more than just the code and comments, since the code might not even be written with the other system in mind.
I think they mean "test code is documentation". For example if there's a unit test that expects an error for a certain input, it serves as documentation that this kind of input is not allowed.
It not always feasible to document every little edge case in natural language and keep it in sync with your code. If you "document" edge cases as tests, they _have_ to be in sync with your code. It shouldn't replace traditional documentation though and is better suited for internal components and not for public API.
No. Your documentation is your documentation. It documents your code. It is not your code.
If the documentation can also be interpreted by machine to validate what it claims is true you have a nice side benefit, but not the reason for writing your documentation.
Disregarding the "code is doc" position, it's still common to have an overview or index for documentation, which points readers in the right direction instead of dumping pages of detailed docs on them.
Now, you could also have a well organized test suite that goes from most obvious to most detailed, split into sections for each use-case, but this sounds a lot more tedious than "write a one-line comment describing the unit test".
>the purpose of testing is to explain to other other developers what the code under test assumes and what should be expected of it under various conditions
No, the point of automated testing is to verify that what is under test behaves correctly and to be able to scale this verification cheaper than having humans do it. Documenting what it verifies and under what conditions is just a side effect.
That is a common falsehood. Testing does not verify that the code under test behaves correctly. It only verifies that the what the documentation asserts correctly matches what the code does. Indeed, enabling the machine to verify that the documentation is true is cheaper than having humans do it. Also less error prone. Humans are notoriously bad at keeping documentation properly up to date.
>You test case is more useless than a turd in the middle of the dining room table unless you put a comment in front of it that explains what it assumes, what it attempts, and what you expect to happen as a result.
This is why I found Gherkin/Cucumber (and BDD in general) to be a total revelation when I first encountered it. No one should be writing tests any other way IMO.
Gherkin/Cucumber reintroduce the very problem TDD/BDD was intended to solve: Documentation falling out of sync with the implementation.
The revelation of TDD, which was later rebranded as BDD to deal with the confusion that arose with other types of testing, was that if your documentation was also executable the machine could be used to prove that the documentation is true. The Gherkin/Cucumber themselves are not executable and require you to re-document the function in another language with no facilities to ensure that the two are consistent with each other.
If you are attentive enough to ensure that the documentation and the implementation are aligned, you may as well write it in plain English. It will give you all of the same benefits without the annoying syntax.
Unit tests assert implementation behaviour to aid refactoring. If developers misunderstand the spec, the unit tests can be valid. They don't assert developer understanding.
Say it with me, unit tests are to aid refactoring.
If we mix QA and implementation details just because both sides use the word "test" it ends in trouble.
QA should be blind to unit test coverage or even usage at all, they're totally independent concerns.
A passing unit test says nothing against correctness of product against a spec or design... only that it works and continues to work as a developer intended, to aid the work of future developers, even if they misunderstood the spec.
Your comment is at the core of why QA is a total mess. Everyone is confused about what "test" means in different contexts.
Why have a QA function at all with 100% unit test coverage? Because the unit tests may encode misunderstanding by developers. They're there to fight entropy, not wrongness.
QA, using BDD and other tools, ensure the product is correct, regardless of how well it fights entropy by unit tests.
This sounds like a good theory but the practice of it is really hard. Pretty quickly you end up with tests that "say" one thing but have nuanced different behavior in the underlying implementation.
Then try to debug a "document"...
I like the idea. But having tried it at scale, it becomes a mess. Code I can understand. I can read English comments. I can't debug English.
I know what typical code does. This code looks simple but that's misleading when you're trying to understand a failure. You want consistency and clarity. You want readablity like code is readable not like a book is readable.
I agree. One nice feature of property-driven testing is that assumptions often end up causing test failures. For example (in ScalaTest):
"Average of list" should "be within range" in {
forAll() {
(l: List[Float]) => {
val avg = l.average
assert(avg >= l.min && avg <= l.max)
}
}
This test will fail, since it doesn't hold for e.g. empty lists. Requiring non-empty lists will still fail, if we have awkward values like NaNs, etc. The following version has a better chance of passing:
"Average of list" should "be within range" in {
forAll() {
(raw: List[Float]) => {
val l = raw.filter(n => !n.isNaN && !n.isInfinite)
whenever (l.nonEmpty) {
val avg = l.average
assert(avg >= l.min && avg <= l.max)
}
}
}
Getting this test to pass required us to make those assumptions explicit. Of course, it doesn't spot everything; here's an article which explores this example in more depth (in Python) https://hypothesis.works/articles/calculating-the-mean
We have a policy of making each test a spec. That is, a test requires a plain text spec to be attached to it in its doc string. It's kind of like BDD but without all the weird DSLs.
Which would replace all those humans producing perfectly valid sounding explanations that if you invest some research effort have no basis in (the usually far more complex, but also far more fascinating and infinitely deep) reality. So yes, I think AI can indeed replace lots of human-produced thoughts :-)
I admit to have been guilty of this myself. I have a famous anecdote-example where I had a very well-paid contractor job and explained something about how my then department's software worked to someone from another department. I think I must have sounded very convincing, the person went off to change something in how they used our stuff. A few minutes later, after accidentally meeting and casually chatting with my boss for that job I realized everything I had said was total garbage. I quickly excused myself from my boss and hurried after the person to tell them to forget and ignore everything I had just explained to them because it was all wrong. I think this last step is not what happens in those cases because we don't usually realize that such a thing just happened.
The brain, or parts of it, are great at producing "explanations". I think that it was part of the more established and reproducible results of psychology that our brain first decides and acts, and only then produces some (often bullshit) "reason" when/if our conscious self asks for one? Does anybody remember if this is true and has a link?
>The brain, or parts of it, are great at producing "explanations". I think that it was part of the more established and reproducible results of psychology that our brain first decides and acts, and only then produces some (often bullshit) "reason" when/if our conscious self asks for one? Does anybody remember if this is true and has a link?
Relevant are Sperry & Gazzaniga's split brain experiments. Participants of these experiments had had their corpus callosum (one of the major "information" pathways between our brain's two halves) cut. This was an operation performed to keep epileptic seizures in check.
In these participants, specific brain "functions" such as speech were highly lateralized, meaning only one half of the brain was able to perform it to a satisfying degree.
Note that these were already not neuro-typical people prior to the experiments (given the regular, debilitating epileptic seizures), so reaching general conclusions from these experiments is hard.
Remember also that, like our brains, our bodies are also highly lateralized, such that the right-half of our brain controls the left-side of our body, and the left-half of the brain controls the right-side of our body. If you ever wanted proof against intelligent design, the way our brain connects to our eyes & body is one very strong argument..
Anyway, one experiment stands to mind where one half of the brain was instructed to perform some action (move the left arm, or something similar). Then the other half would be asked _why_ that arm was just moved. It would confabulate, on the spot, totally legit, but obviously bullshit, sounding reasoning. E.g. "I felt cold so I wanted to put on a coat", rather than "the experimenter instructed me to move it".
So, rather than claiming "I don't know", it would just make up a plausible reasoning. It is really unimaginable to _not_ know why you moved your arm..
> During day-to-day development, the important bit isn't that there are no failures. The important bit is that there are no regressions.
And that's why we test and why tests shouldn't be allowed to fail.
Just because the scenarios described make testing hard does not change reality of what makes tests valuable.
If pre-existing failures are halting the production pipeline and you don't like it, switch off trunk based development and see if you like the waits and constant rebasing in large projects/teams. But don't eff with the bloody tests!
when the codebase gets large enough you need to allow some tests to "fail" but really I mean you need a way to quickly mark a failing test as flakey so the author can fix it while everyone else can get on with their day and merge code.
At $dayjob this works well, if your CI comes up red with some unrelated test failing, you can mark the test as flakey in the UI and CI will allow your code to merge and a Jira ticket will be created for the test owner to fix their test (and it will be disabled for future test runs)
I think for small to medium projects, you can have all tests succeed but once the repo is large enough / has frequent enough changes, flakey tests are bound to slip in.
Our setup just reruns the tests a few times which sorts out flaky tests. The page then shows the most frequently failing tests so they can be properly fixed.
I think GitHub does something similar - public website tests must always pass but if you break GitHub Enterprise you get three days to fix it (or something like that - I think they had a blog post on it).
If testing that way is painful (and it is), then work with people to remove the pain. Tests are supposed to help developers, not constrain or punish them.
Put tests in the same repo as the SUT. Do more testing closer to the code (more service and component tests) and do less end-to-end testing. Ban "flakey" tests - they burn engineering time for questionable payoff.
Test failures can be thought of as "things developers should investigate." Make sure the tests are focused on telling you about those things as fast as possible.
Also, take the human out of the "wait for green, then submit PR" steps. Open a PR but don't alert everyone else about it until you run green, maybe?
It would work for most "classical" software development. In this case, the author talks about conformance tests (a HUGE collection) from an external vendor. Most of them will fail at first, then you make them pass slowly but steadily.
The problem becomes: I want to know if there are significant regressions in the vendor tests, ie. tests that were green for a long time and suddenly changed. You could flag any test that became green at some point as "required" to pass the CI, but then you have tests that randomly succeed or fail depending on code you have not yet written (eg. locking around concurrent structures). Marking these tests manually is impractical and could definitively be replaced by tooling that supports some statistical modeling of success/failure.
You may have the best testing strategy for internal code but as long as you have to test against these conformance tests it's simply unfeasible to say "sorry, only green allowed".
> take the human out of the "wait for green, then submit PR"
It'd be great if GitHub could open a PR for reviews (aka un-draft) automatically after CI succeeds. (If not in the core product, is there a bot that does that?)
My company uses a workflow where we don't use PRs for code reviews. Instead we each have our own git repo that's a fork of the tech lead's, with some git rules in place to impose a branch namespace. To open a review request you push a branch into the reviewer's repository. Our CI system detects the new branch and starts running it. Once CI passes that updates the bug tracker which triggers a notification to the reviewer.
The reviewer then does a git fetch, and then checks out the newly created rr/ branch. They make any small changes that aren't worth a roundtrip and push them to the rr branch. They add FIXME comments for bigger changes. They then either assign the ticket back to the developer, or go ahead and merge straight into their own dev branch. Once an rr branch is merged it's simply deleted. The dev branch is then pushed and CI will merge it to that user's master when it's green.
IntelliJ will show branches in each origin organized by "folder" if you use backslashes in branch names, and gitolite (which is what we use to run our repos) can impose ACLs by branch name too. So for example only user alice can push to a branch named rr/alice/whatever in each persons repo. That ensures it's always clear where a PR/RR is coming from.
Because each user gets their own git repo and cloned set of individual CI builds, you can push experimental or WIP branches to your personal area and iterate there without bothering other people.
This workflow gets rid of things like draft PRs (which are a contradiction), it ensures each reviewer has a personal review queue, it means work and progress is tracked via the bug tracker (which understands commands in commit messages so you can mark bugs as fixed when they clear CI automatically) and it eliminates the practice of requesting dozens of tiny changes that'd be faster for the reviewer to apply themselves, because reviewer and task owner can trade commits on the rr branch using git's features to keep it all organized and mergeable.
Seems to me like you're underinvesting in tooling. It's a mistake a lot of development shops make - you focus on your product, so you can't spend time building something completely orthogonal, but in the process you suddenly waste man-years wasting time on a broken PR process, instead of spending a month early on building some tooling that would have removed the pain in the first place.
The continuous testing is something I’ve thought about and it’s a tricky one. We use property tests[1] so here’s a quick stab at how I’d like it to look like:
Test starts failing, immediately send a report with the failing input, then continue with the test case minimisation and send another report when that finishes.
Concurrently, start up another long running process to look for other failures, skipping the input that caused the previous failure. We do want new inputs for the same failure though. This is the tricky one. We could probably make it work by having the prop test framework not reuse previously-failing inputs, but that’s one of the big strategies it uses to catch regressions.
> The above development practice works well when the SUT and TB are both defined by the same code repository and are developed together.
I once witnessed a team creating an app, specs and tests in three respective repositories. For no other reason than "each project should be in it's own repository".
The added work/maintenance around that is crazy, for absolutely no gain in that case.
Phase 1. Code and test basic functions concerning any kind of arithmetic, mathematical distribution, state machines, file operations and datetimes. This documents any assumptions and makes a solid foundation.
Phase 2. Write a simulation for generating randomized inputs to test the whole system. Run it for hours. If I can't generate the inputs, find as big a variety of inputs as possible. Collect any bugs, fix, repeat. This reduces the chances of finding real time bugs by three orders of magnitude.
This has worked really well in the past whether I'm working on games, parsers or financial software. I don't conform to corporate whatever driven testing patterns because they are usually missing the crucial part 2 and time part 1 incorrectly.
The author's problem is pretty simple: the test repo is required for pre-merge tests to pass, but it can be updated independently, without having pre-merge tests pass.
And the answer is pretty simple: pin the specific test repo version! Use lockfiles, or git submodules, or put "cd tests && git checkout 3e524575cc61" in your CI config file _and keep it in the same repo as source code_ (that part is very important!).
This solves all of author problems:
> new test case is added to the conformance test suite, but that test happens to fail. Suddenly nobody can submit any changes anymore.
Conformance test suite is pinned, so new test is not used. A separate PR has to update conformance test suite version/revision, and it must go through regular driver PR process and therefore must pass. Practically, this is a PR with 2 changes: update pin and disable new test.
> are you going to remember to update that exclusion list?
That's why you use "expect fail" list (not exclusion) and keep it in driver's dir. Ad you submit your PR you might see a failure saying: "congrats, test X which was expect-fail is now passing! Please remove it from the list". You'll need to make one more PR revision but then you get working tests.
> allowing tests to be marked as "expected to fail". But they typically also assume that the TB can be changed in lockstep with the SUT and fall on their face when that isn't the case.
And if your TB cannot be changed in lockstep with SUT, you are going to have truly miserable time. You cannot even reproduce the problems of the past!
So make sure your kernel is known or at least recorded, repos are pinned. Ideally the whole machine image, with packages and all is archived somehow -- maybe via docker or raw disk image or some sort of ostree system.
> Problem #2 is that good test coverage means that tests take a very long time to run.
The described system sounds very nice, and I would love to have something like this. I suspect it will be non-trivial to get working, however. But meanwhile, there is a manual solution: have more than one test suite. "Pre-merge" tests run before each merge and contain small subset of testing. A bigger "continuous" test suite (if you use physical machines) or "every X hours" (if you use some sort of auto-scaling cloud) will run a bigger set of tests, and can be triggered manually on PRs if a developer suspects the PR is especially risky.
You can even have multiple levels (pre-merge, once per hour, 4 times per day) but this is often more trouble than it worth.
And of course it is absolutely critical to have reproducible tests first -- if you come up to work and find a bunch of continuous failures, you want to be able to re-run with extra debugging or bisect what happened.
> And the answer is pretty simple: pin the specific test repo version! Use lockfiles, or git submodules, or put "cd tests && git checkout 3e524575cc61" in your CI config file _and keep it in the same repo as source code_ (that part is very important!).
Indeed. Where I work we have a bunch of repos, but they always reference each other via pinned commits. We happen to use Nix, with its built in 'fetchGit' function; it's also easy to override any of these dependencies with a different revision. For example:
This is a function taking two arguments ('helpers' and 'some-library'), with default arguments that fetch particular git commits. This gives us the option of calling the function with different values, to e.g. build against different commits.
We run our CI on GitHub Actions, which allows some jobs to be marked as 'required' for PRs (using branch protection rules). The normal build/test jobs use the default arguments, and are marked as required: everything is pinned, so there should be no unexpected breakages.
Some of our libraries also define extra CI jobs, which are not marked as required. Those fetch the latest revision of various downstream projects which are known to use that library, and override the relevant argument with themselves. For example, the 'some-library' repo might have a test like this:
import (fetchGit {
url = "git://url-of-some-library.git";
ref = "master";
# No 'rev' given, so it will fetch 'HEAD'
}) {
# Build with this checkout of some-library, instead of the pinned version
some-library = import ./. {};
}
This lets us know if our PR would break downstream projects, if they were to subsequently update their pinned dependencies (either because we've broken the library, or the downstream project is buggy). It's useful for spotting problems early, regardless of whether the root cause is upstream or downstream.
Yeah - developers need to control their own tests. If in the weird case they don't control their tests (conformance tests) - you need to control when those tests are added.
Some good ideas here for when your tests are in a separate repo than the system under test (GPUs/drivers/compilers in the case of the author, but it's applicable to a variety of industries).
Tests in seperate repo is the worst anti pattern I have seen. It’s extremely common that a change requires a change in tests but it’s impossible to correctly manage this situation if the tests can’t be updated in the same commit/pr.
The only time "tests in a separate repo" makes sense to me is if they are truly cross-functional end to end tests that exercise several systems.
Those tests should be as small as possible to verify that everything is still wired together correctly.
Everything else should be either unit tests or narrow integration tests between a small handful of components. And as you said, they should live in the repository of the software they test.
I can't think of any project I've worked on where external test suites even make sense. I suppose it would work when you have a very clear spec or compliance document you could write independent tests, or if you're rewriting a system and need the public API to be bug-for-bug compatible with the old one, but other than those niche use cases I wouldn't want to keep those tests external at all.
Even if you do have external tests, you still need internal ones for the surface area your external tests don't check for. Unit tests and such don't make sense at all combined with a separate test repo.
Think systems integrators and compliance tests. I would imagine that each of the individual systems being "integrated" do have their own unit tests, upstream, in their own repos.
In that case you have to release versions with compatibility for both the new and old way. At no point can I ever see it being a good idea to just let tests fail.
Have I misunderstood the article or it is just a matter of separating feature branches and putting relevant tests in a feature branch while keeping regression in a master branch?
You test case is more useless than a turd in the middle of the dining room table unless you put a comment in front of it that explains what it assumes, what it attempts, and what you expect to happen as a result.
Because if you just throw in some code, you're only giving the poor bastard investigating it two puzzles to debug instead of one.