Hacker News new | ask | show | jobs
by jonthepirate 2619 days ago
Having been at both Lyft and DoorDash where I've been an engineer responsible for unit test health, I decided to do a side project called Flaptastic (https://www.flaptastic.com/), a flaky unit test resolution system.

Flaptastic will make your CI/CD pipelines reliable by identifying which tests fail due to flaps (aka flakes) and then give you a "Disable" button to instantly skip any test which is immediately effective across all feature branches, pull requests, and deploy pipelines.

An on-premise version is in the works to allow you to run it onsite for the enterprise.

1 comments

I don't want to come across as negative, but just an observation and to play devil's advocate - wouldn't it be better to fix the flaky test or delete it entirely instead of build a feature to disable it during a test run in an automated fashion?

Whenever our team has a significant number of flakey tests (more than 1-2) we usually schedule a bug squash session to fix them and amortize the cost over the whole team.

What you really want to do is first disable a test you know is unhealthy to unblock everybody. Then, you fix it. After you've reintroduced it healthy, you can turn it back on.
I was talking to someone from Google who works on Bazel things, and he brought an interesting point: flaky tests are asymmetric in that they don't provide much value when they fail (since you don't know if the failure was due to flakiness), but they do provide a lot of value when they pass (because they presumable test something non-trivial.)

With this in mind, what Bazel does when a test is marked flaky is run it several times. This is a simple way of minimizing the effect of flakiness while still getting confidence from green tests.

I dislike rerunning flaky tests. It too often masks genuine failures.
If the effort required to mark-disable/comment out/rm a known-unhealthy test is more than a few seconds beyond the efforts to navigate through a tool like the one you describe, I think the problem is likely in the change control/source control processes being employed. That seems like it should be so easy as to not need an additional tool (unless tests are flaking out so often that even the <1min of overhead to disable them is adding up, in which case I suspect that people are misinterpreting something fundamental about the role of tests in their development processes).
What I've seen in most companies is that when a test goes bad (imagine 10k unit tests, and 1 hits stripe's api sandbox which just went down) the bad test affects everybody who's busy working on their respective feature branches. Everybody wonders how their feature branch broke the stripe integration and you have hundreds of developers trying to diagnose and fix the same broken test.

Our solution allows the someone to know the test failed because its flaking out immediately as soon as it flakes, and provides a 1 click option to instantly disable that test across all feature branches so that everybody else can continue working undisturbed.

Without something like this, you have to: 1) Create a new feature branch 2) Commented out the broken test 3) Wait for it to pass CI 4) Gain approvals as needed 5) Merge the PR back to the master line 6) Message everybody to let them know the test was removed and they should rebase

The process above is sort of the industry standard and this means a giant loss in productivity for everybody on your team and is especially painful for monolith codebases.

Companies where I've worked easily hemorrhage $1m per year on this problem in terms of developer productivity losses if you consider the number of hours wasted per year.

Your first example is an integration test, not a unit test, which should be changed.

Integration tests are nice, but best if ran separately...

Best practice is actually just to disable all tests that are failing. Can't hold up our sprint deadlines!
Failing != flaking. If your tests interact with any level of randomness (seed data, time based constraints, etc) you're going to find the occasional test that doesn't work and subsequently works on the rebuild.

If something is consistently failing I would assume this tool does not disable it.