The Hidden Costs of Flaky Tests: A Deep Dive into Test Reliability

[article]
Summary:

Flaky tests present a significant challenge in software development, leading to frustration and inefficiency. These automated tests, which exhibit inconsistent pass/fail behavior without codebase changes, represent a substantial drain on resources and compromise the reliability of testing processes. This article explores the often-overlooked costs associated with flaky tests, examines their underlying causes, and discusses effective strategies for mitigation.

As a software engineer with over a decade of experience in test automation, I've witnessed firsthand the frustration and inefficiency caused by flaky tests. These temperamental tests, which pass and fail inconsistently without any changes to the codebase, are more than just a nuisance—they're a significant drain on resources and a threat to the integrity of our testing processes.

In this article, we'll explore the true impact of flaky tests on software development, examine their root causes, and discuss strategies to mitigate their effects. By the end, you'll understand why addressing test flakiness should be a priority for any development team striving for efficiency and reliability.

The Real Cost of Flaky Tests

An occasional test failure might seem like a minor inconvenience. However, the cumulative effect of flaky tests can be staggering:

Wasted Developer Time

According to a study by Google, flaky tests account for 4.56% of test failures, costing the company over 2% of coding time [1]. For a team of 50 developers, this translates to losing an entire person-year annually just to deal with unreliable tests.

For a clearer understanding, let's dissect the data:

  • Assuming an average yearly salary of $120,000 for a software developer
  • 2% of coding time for 50 developers equates to 1 full-time equivalent (FTE)
  • The direct cost in lost productivity: $120,000 per year

But the actual cost goes beyond just the salary. Consider the opportunity cost of features not developed, bugs not fixed, and improvements not made due to this lost time.

Delayed Releases

In a continuous integration/continuous deployment (CI/CD) environment, flaky tests can trigger false alarms that halt the pipeline. A survey by GitLab found that 36% of developers experience delayed releases due to test failures at least once a month [2].
Let's examine the potential impact:

  • For a company releasing bi-weekly, this could mean 1-2 delayed releases per quarter.
  • Each delay might push back the release by a day or more.
  • For a product generating $10 million in annual revenue, a one-day delay could cost over $27,000 in lost revenue.
  • This doesn't account for the cost of emergency meetings, additional QA efforts, and potential loss of customer trust.

Erosion of Trust

When tests become unreliable, developers start to ignore their results. This undermines the entire purpose of automated testing and can lead to real bugs slipping through unnoticed.

The erosion of trust can manifest in several ways:

  • "Retry culture": Developers automatically rerun failed tests, assuming they're flaky.
  • Ignoring test results: Teams may start to deploy despite test failures, defeating the purpose of the test suite.
  • Decreased investment in testing: If tests aren't trusted, less effort goes into maintaining and improving them.

A case study from Mozilla found that after addressing their flaky tests, developers' confidence in the test suite increased by 29%, leading to faster issue resolution and fewer escaped bugs.

Increased Cognitive Load

Debugging flaky tests often requires context-switching and deep investigation, disrupting developer flow and productivity.

Consider the following scenario:

  1. A developer is deep in feature development.
  2. A flaky test fails in CI, blocking the merge.
  3. The developer must switch context, dive into test logs and potentially unfamiliar parts of the codebase.
  4. After spending hours, they might find that the test passes on rerun, with no clear cause found.

This context-switching doesn't just waste time—it disrupts the deep focus needed for complex problem-solving, potentially impacting the quality of the primary task the developer was working on.

Root Causes of Test Flakiness

To effectively combat flaky tests, we need to understand their origins. Common causes include:

Asynchronous Wait Issues: Tests that don't properly account for asynchronous operations can fail intermittently when timing varies. This is particularly common in frontend testing and when dealing with external services.

Example scenario:

// Flaky test

test('user profile loads', async () => {

 const profilePage = await navigateToProfile();

 const username = profilePage.getUsernameElement();

 expect(username.text()).toBe('JohnDoe');

});


// More reliable version

test('user profile loads', async () => {

 const profilePage = await navigateToProfile();

 await profilePage.waitForUsername();

 const username = profilePage.getUsernameElement();

 expect(username.text()).toBe('JohnDoe');

});


Resource Leaks:
Tests that don't clean up resources properly can interfere with subsequent tests. This can lead to a cascade of failures that are hard to diagnose.
Common culprits include:

  • Unclosed database connections
  • Temporary files that are not deleted.
  • Global state modifications that are not reset.

External Dependencies: Reliance on external services or databases can introduce inconsistencies. These dependencies may have their own reliability issues or may be rate-limited, causing sporadic test failures.

Concurrency Problems: Race conditions in multithreaded applications can cause sporadic failures. These are often some of the hardest flaky tests to diagnose and fix.

Environmental Inconsistencies: Differences between development, test, and production environments can lead to flakiness. This could be due to different OS versions, library versions, or configuration settings.

Strategies for Combating Flaky Tests

Now that we understand the impact and causes of flaky tests, let's explore strategies to address them:

1. Implement Proper Waiting Mechanisms

Replace arbitrary sleep statements with intelligent waits. Libraries like WebDriverWait for Selenium or AsyncTest for JavaScript can help manage asynchronous operations more reliably.


# Instead of this:

time.sleep(5)

element.click()


# Use this:

WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, "myElement")))

element.click()


Additionally, consider implementing custom wait conditions for complex scenarios:

def wait_for_data_load(driver):

  return len(driver.find_elements_by_class_name("data-row")) > 0

WebDriverWait(driver, 10).until(wait_for_data_load)


2. Isolate Tests

Ensure each test runs in isolation, with its own setup and teardown procedures. This prevents tests from interfering with each other and makes it easier to find the source of flakiness.


def setUp(self):

  self.db = create_test_database()

  self.app = create_test_app(self.db)


def tearDown(self):

  self.db.clear()

  self.app.shutdown()

Consider using containerization to provide a fresh environment for each test:

# Docker Compose file for test isolation


version: '3'

services:

 test:

  build: .

  command: python -m unittest discover tests

  environment:

- DATABASE_URL=postgres://testuser:testpass@db:5432/testdb

db:

  image: postgres:13

  environment:

  - POSTGRES_USER=testuser

  - POSTGRES_PASSWORD=testpass

  - POSTGRES_DB=testdb

3. Mock External Dependencies

Use mocking frameworks to simulate external services and databases. This reduces reliance on potentially unstable external factors.


@patch('myapp.external_service.api_call')

def test_feature(mock_api_call):

  mock_api_call.return_value = {'status': 'success'}

  result = my_feature()

  assert result == 'expected output'


For more complex scenarios, consider using tools like WireMock or MockServer to simulate entire API endpoints.

4. Implement Retry Mechanisms

For tests that are inherently prone to occasional failures due to factors beyond your control, implement a retry mechanism. However, use this sparingly and as a last resort, as it can mask underlying issues.

@retry(stop_max_attempt_number=3, wait_fixed=1000)

def test_with_retry():

  # Test logic here

  pass


5. Continuous Monitoring and Analysis

Implement tools to track test stability over time. Many CI/CD platforms offer built-in flaky test detection. For example, Jenkins has the "Test Stability History" plugin, which provides insights into test reliability [3].

To go beyond built-in tools, consider implementing a custom flaky test detection system:

  1. Store test results in a database, including metadata like execution time, environment details, and stack traces for failures.
  2. Implement algorithms to detect patterns in test failures.
  3. Generate regular reports on test stability and trends.
  4. Automatically quarantine tests that exceed a flakiness threshold for further investigation.

6. Prioritize Fixing Flaky Tests

Treat flaky tests as technical debt. Allocate dedicated time in each sprint to investigate and fix unreliable tests. Google's Engineering Productivity Research team found that addressing flaky tests early can significantly reduce their long-term impact [4].

Implement a "flaky test budget" for your team:

  1. Set a target for the maximum acceptable number of flaky tests (e.g., <1% of total tests)
  2. Regularly review and prioritize flaky tests for fixing.
  3. Assign story points or time estimates to flaky test fixes, treating them as first-class development tasks.
  4. Consider implementing a "fix or delete" policy for long-standing flaky tests.

Case Study: Microsoft's Battle with Flaky Tests

Microsoft's journey in tackling flaky tests offers valuable insights. In a presentation at the 2020 International Conference on Software Engineering, Microsoft researchers shared their experiences [5]:

  1. They developed a tool called "Deflaker" to automatically identify and quarantine flaky tests.
  2. By implementing a company-wide policy to fix or remove flaky tests within two weeks, they reduced overall test flakiness by 18% in six months.
  3. The initiative resulted in a 2.5% increase in developer productivity, equating to millions of dollars in saved engineering time.
  4. Microsoft implemented a "flaky test score" for each project, which became part of their engineering health metrics.
  5. The company developed best practices and training materials for writing stable tests, which became part of their onboarding process for new developers.

Conclusion

Flaky tests are more than just a minor annoyance—they are a significant drain on resources and a threat to the reliability of our software development processes. By understanding their causes and implementing targeted strategies to address them, we can dramatically improve the efficiency and effectiveness of our testing efforts.

Remember, the goal isn't just to have tests that pass, but to have tests we can trust. Investing time and resources in combating test flakiness pays dividends in the form of faster releases, increased developer productivity, and higher quality software.

As an industry, we must prioritize test reliability with the same vigor we apply to feature development. Only then can we fully realize the benefits of automated testing and continuous integration.

To take your test reliability efforts to the next level:

  1. Conduct a flaky test audit of your current test suite.
  2. Implement a flaky test detection and monitoring system.
  3. Establish clear policies and procedures for addressing flaky tests.
  4. Invest in developer education on writing stable tests.
  5. Regularly review and refine your testing strategies based on collected data.

By making test reliability a core part of your development culture, you'll not only improve your testing process but also enhance overall software quality and team productivity.

References

[1] Q. Luo, F. Hariri, L. Eloussi, and D. Marinov, "An empirical analysis of flaky tests," in Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2014.
[2] GitLab, "2020 Global DevSecOps Survey," GitLab, 2020.
[3] Jenkins, "Test Stability History Plugin," Jenkins Plugin Index, 2021.
[4] J. Micco, "Flaky Tests at Google and How We Mitigate Them," Google Testing Blog, 2016.
[5] A. Shi, J. Bell, and D. Marinov, "Mitigating the effects of flaky tests on mutation testing," in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020.

User Comments

2 comments
Zafira  Lann's picture

Really relatable read. Flaky tests have caused a lot of trouble for our team too. The worst part is how they make people lose trust in automated tests, even when most are fine.

We've learned how important it is to hire talent, QA professionals, who know how to deal with these issues. It's not just about the tools, it's about the right mindset.

May 9, 2025 - 6:23am
Greg Smith's picture

I have 25 years of manual testing and have on many occations had to "prove" that the automated test resuts were incorrect or correct .  I usually have to help test the automated tests to prove that they are not Flakey or reporting false resutls. There can be a subtle change someplace in the codebase that can cause a break in a test that a developer would not think would be impacted. Developers and Testers have different mindsets on how to approach testing and both are needed to get the job done completly. Testing code with code to me is a step in the process but not the final step in testing.

May 14, 2025 - 11:41am

About the author

StickyMinds is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.