Making Sense of Root Cause Analysis

[article]
Summary:
Applying Root Cause Analysis (RCA) to software problems is fundamentally different from applying it to other engineering disciplines. Rather than analyzing a single major failure, we are usually analyzing a large number of failures with software. In this column, Ed Weller explains how to use RCA to your advantage.

Root Cause Analysis (RCA) is often seen as having short-term impacts, when, in fact, the returns may be long term. Understanding the differences is critical to successful implementation of RCA. The chart below lists some of the differences between RCA in software and RCA in other disciplines. This article will address these differences and how they should shape application of RCA to software.

Items one and four are related and are sometimes misunderstood by those who initiate RCA as a "solution" to software defect prevention (using the "If we get to the root cause of critical customer problems, we can make them go away" train of thought). This confuses the physical realm with the intellectual realm. Three Mile Island, TWA 800, Challenger, and other major disasters could be analyzed to identify operational or physical failures that, once identified, could be prevented by a redesign of a part, system, or operational procedures. When people apply these methods to the software profession and expect analysis of critical problems to prevent future failures, they fail to understand the root causes of software defects.

They falsely assume that something that causes a major customer failure must somehow have been caused by a major oversight or repeatable cause, where the consequence of the fault will always be proportionate or related to the initial error. For software this is not true. Single-character coding errors caused Mariner 1 to crash into Venus and the 800 telephone system to crash back in 1994. This means that RCA of serious failures will not consistently prevent other serious failures, as the root cause of simple failures may generate serious failures in other parts of the product. A simple typographical error can have a minor impact in one case and a catastrophic failure in another case.

  Software Other Engineering Discipline
1 Many failures of varying consequence Single events with major and often catastrophic results
2 Failures caused by "intellectual" shortfalls Failures caused by physical interactions or mechanical fatigue,
operational errors, management failures, or design errors
3 Many common-cause faults Typically unique faults
4 No
relation of cause to failure or future failures in many cases
Cause may repeat with similar consequence
5 Typically low effort per failure Often significant effort, many times with political overtones
6 Prevention may be well into the future; requires investment to prevent future errors Scapegoat and financial responsibility

About the author

Ed Weller's picture Ed Weller

Ed Weller is an SEI certified High Maturity Appraiser for CMMI® appraisals, with nearly forty years of experience in hardware and software engineering. Ed is the principal of Integrated Productivity Solutions, a consulting firm that is focused on providing solutions to companies seeking to improve their development productivity. Ed is a regular columnist on StickyMinds.com and can be contacted at edwardfwelleriii@msn.com.

StickyMinds is one of the growing communities of the TechWell network.

Featuring fresh, insightful stories, TechWell.com is the place to go for what is happening in software development and delivery.  Join the conversation now!

Upcoming Events

Nov 09
Nov 09
Apr 13
May 03