Start Trusting Your Test Automation Again

The more you rely on feedback from your automated tests, the more you need to be able to rely on the quality and defect-detection power of these tests. Unfortunately, instead of being the stable and reliable guardians of application quality they should be, automated tests regularly are a source of deceit, frustration, and confusion. Here's how you can start trusting your automated tests again.

Development teams these days often rely on feedback from automated tests to determine the quality of the application they are working on. Running automated tests every night or after every commit, depending on the type of test and the adopted development process, gives teams information about the impact that changes made to the code have had on its overall workings.

However, with great power comes great responsibility. The more you rely on the feedback from your automated tests to decide whether to let a build pass to the next step in your build pipeline (and this next step might just be a deploy into production!), the more you need to be able to rely on the quality and defect-detection power of these very automated tests.

In other words, trust in the quality of an application is required, so if the application is tested in an automated fashion at any point, you must also trust in the quality of these automated tests.

Unfortunately, this is where test automation all too often fails to live up to its promise. Instead of being the stable and reliable guardians of application quality they should be, automated tests regularly are a source of deceit, frustration, and confusion. They end up undermining the very trust they are meant to provide in the first place.

How can we start trusting our automated tests again? Let's take a look at two ways automated tests can harm trust instead of creating it, then explore what you can do to repair the situation—or, even better, prevent it from happening in the first place.

False Positives

Tests that fail for reasons other than a defect in your application under test—or at least a mismatch between an expected and an actual test result—are known as false positives.

This type of harmful automated test occurs most often with user interface-driven tests, simply because these tests provide the largest amount of potential ways things can go wrong (synchronization and timeout issues, insufficient exception handling in the automated test solution, etc.). If your timeout and exception handling aren't implemented properly, false positives can be very time-consuming in terms of root cause analysis, as well as preventing them from happening in the future. In cases of intermittent occurrence of these false positives (also known as “flaky tests”), it can be even harder to find out what is causing the trouble.

False positives are especially frustrating when your team or organization is adopting a continuous integration or continuous delivery approach to software development. When your build pipeline contains tests that intermittently cause false positives, your build might break every now and then simply because your tests aren't robust enough to handle exceptions and timeouts in a proper manner.

This might eventually cause you to just remove the tests from your pipeline altogether. Although this could solve your broken builds problem at that moment, it is not a strategy that is sustainable in the long term. There is a reason you're putting effort into creating these tests (right?), so they should be part of the automated testing and delivery process. Therefore, you should put effort into investigating the root cause of these false positives as soon as they occur and repair them right away.

Even better, you should put effort into creating robust and stable tests, including proper exception handling and synchronization strategies, to prevent these false positives from occurring in the first place. This will undoubtedly take time, effort, and craftsmanship up front, but a solid foundation will pay off in the long term by resulting in the eventual absence of these pesky false positives.

False Negatives

While false positives can be very frustrating, at least they make their presence known by means of error messages or broken CI builds. The real risk of decaying trust in test automation lies in false negatives: tests that pass when they should not.

The more you're relying on the result of your automated test runs when making procedural decisions, such as a deployment into production, the more important it is that you can rely on your test results being a good representation of the actual quality of your application under test, rather than an empty shell of tests that pass yet do not perform the checks they are supposed to.

False negatives are especially tricky to detect and deal with for two distinct reasons:

  • As said above, they do not actively show their presence, such as by throwing an error message, like true positives and false positives do
  • While some tests might be false negatives right from the start, a significant part of false negatives are introduced over time and after several changes in the application under test

A vital part of your test automation strategy, next to creating new tests and updating outdated ones, should therefore be to regularly check up on the defect detection power of your existing automated tests. This applies especially to tests that have been running smoothly and passing since their creation. Are you sure they really still execute the right assertions? Or would they also pass when the application under test behaved erroneously?

I like to call this regular checkup on your existing test suite “keeping your tests fresh.” Here’s a solid strategy for periodically making sure your tests are still worthy of the trust you put in them:

  1. Test your tests when you create them. Depending on the type of test and assertion, this can be as simple as negating the assertion made at the end of the test and seeing whether the test fails. When you're adopting test-driven development, you're automatically doing something similar, because your tests do not pass until you implement working production code.
  2. Periodically, go through your tests and make sure they a) are not made redundant by the changes that have been made to your application since the test was created (a passing test that's irrelevant might be nice for the statistics, but imposes a maintenance burden you likely can do without), and b) still possess their original defect detection power.

For unit tests, I recommend at least looking into mutation testing as a tool for creating and maintaining a high-quality, powerful test suite. Mutation testing creates mutant versions of your code (for example, by flipping a relational operator such as a “greater than” symbol) and subsequently seeing if the test fails. If the test still passes despite the change made by the mutation testing algorithm, it is incapable of detecting these changes and should be dealt with accordingly.

Even though doing a mutation testing run can be a time-consuming process, especially for large applications, it might give you some useful insight into the quality and defect-detection power of your unit test suite.

Unfortunately, I am not aware of any similar tools for other types of tests, such as integration and end-to-end tests. For these tests, I recommend regularly going through your suite manually to keep stock of the quality of your tests and keep your tests fresh.

Create the Right Tests in the First Place

Building trustworthy automated tests requires development skills, but it may be even more important to have the skill required to decide whether to automate a test at all in the first place.

It also takes skill to determine the most efficient way to automate a given test. I like to do as my namesake Edsger Dijkstra recommended and look for the most elegant solution to a test automation problem—the simple, clear strategy. If the path I'm taking isn't leading to an elegant solution, then this solution probably is not the most efficient one, either.

This translates to trustworthiness, too: If your solution isn't elegant, there are likely too many ways in which the test results can deceive you. Simplicity and clarity will help make you secure in trusting your test automation.

User Comments

Bas Dijkstra's picture

Thank you for the kind words, Nathaniel!

August 15, 2017 - 3:43am
matus poruban's picture

Very good article. I am very proud of that developers trust my collection. But after reading the article,  I want to go over it and double check everything, just in case.

August 22, 2017 - 7:24am
Bas Dijkstra's picture

Excellent! As they say, an ounce of prevention is worth a pound of cure!

August 22, 2017 - 7:38am
hosk doug's picture

Great article. Has anyone successfully integrated mutation testing into their pipeline? I see a fair few articles suggesting to investigate it to assess your test quality, but none describing a successful integration.

August 28, 2017 - 5:39am
Bas Dijkstra's picture

Hey Doug,

thanks for the kind words! Personally, I do not have a mutation testing success story from practice at hand, nor have I ever read one.

Three possible reasons:

  • Mutation testing isn't as widespread an activity as regular testing yet
  • Mutation testing can become quite a time intensive task for larger code bases, which reduces the benefits of fast feedback from regular automated tests
  • I just don't read the right things

Because of reason #2, I think most teams that dabble with mutation testing do it as a regular side activity rather than integrating it in their pipeline. Or it isn't seen as mission critical because mutation testing results do not tell you something about the quality of your product (at least not directly). Instead, it tells you something about

  1. The quality of your unit tests themselves
  2. Which of your code base aren't covered with unit tests

This is just me thinking out loud. Really need to dive into mutation testing some more myself.

August 28, 2017 - 6:23am
hosk doug's picture

mutation testing results do not tell you something about the quality of your product

Ho-ho!  This segues on to a subject which has been fascinating and frustrating me in equal measure lately.  How can you trust your automated tests to tell you (and your stakeholders, who are used to actual people testing the software manually and watching it pass or fail!!) that your system is ready to ship, replacing (at least in part) manual regression testing?

Richard Dawkins says at the start of "The God Delusion":

If you'll stoop to magicking into existence an unexplained peacock-designer, you might as well magic an unexplained peacock and cut out the middleman.

By trusting our automated tests, are we not saying something similar?  

Due to human nature, and the nature of software development, we (rightly!) do not simply trust that a software system is ready to release, and that developers have not inadvertently introduced regressions into the system.

To mitigate this, we are placing our trust in suites of automated tests.

But hold on, are these test suites not themselves complex software systems, usually built and maintained by the same team that built the system in question?

We do not simply trust that that commit which was code-reviewed late on Friday afternoon did not break a crucial piece of app functionality.  But we do however trust that the same commit did not subtly cause false-positives in a crucial test!

We all know that test coverage metrics do not provide clear answers to this.  It strikes me that mutation testing could provide at least some of the answers.


August 28, 2017 - 9:30am
Bas Dijkstra's picture

Hey Doug,




you're absolutely right there, that was a rather badly thought through comment of mine. Of course mutation testing results tell you something about the quality of your application under test, albeit in a somewhat less direct manner. Also, even if you're using mutation testing, your information would still be limited to unit testing scope. Defects at the integration or end-to-end level would still live on.

It IS an interesting discussion though, trusting your automated tests. Not coincidentally, I'll be giving a talk about this very subject in a couple of months at TestBash in Manchester. Thank you for your comments, they're making me think harder about the subject, and that (hopefully) means an even better talk :)




By the way, where you say


We do not simply trust that that commit which was code-reviewed late on Friday afternoon did not break a crucial piece of app functionality.  But we do however trust that the same commit did not subtly cause false-positives in a crucial test!


Don't you mean false negatives instead? A false positive lets the alarm bells go off, a false negative is a silent killer.











August 28, 2017 - 9:47am

About the author

StickyMinds is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.