TrainingConferencesAbout UsContact UsAdvertiseSQE.comRSS Feed

StickyMinds.com: brain food for building better software

Log In
 Clarify Your Search Criteria

Tips on Using Our Search Feature(s)
 
StickyMinds.com Home
ResourcesTopicsCommunityPowerPassBlogs
Home  >  Detail: Making Sense of Root Cause Analysis



A StickyMinds.com Original
Article Picture
Making Sense of Root Cause Analysis

By Ed Weller

Send This Content to a FriendGet a Short Link to This ContentPrint This ContentSee User Comments About This Content

Summary: Applying Root Cause Analysis (RCA) to software problems is fundamentally different from applying it to other engineering disciplines. Rather than analyzing a single major failure, we are usually analyzing a large number of failures with software. In this week's column, Ed Weller explains how to use RCA to your advantage.


Infosys
RCA is often seen as having short-term impacts, when, in fact, the returns may be long term. Understanding the differences is critical to successful implementation of RCA.

The chart below lists some of the differences between RCA in software and RCA in other disciplines. This article will address these differences and how they should shape application of RCA to software.

Items one and four are related and are sometimes misunderstood by those who initiate RCA as a "solution" to software defect prevention (using the "If we get to the root cause of critical customer problems, we can make them go away" train of thought). This confuses the physical realm with the intellectual realm. Three Mile Island, TWA 800, Challenger, and other major disasters could be analyzed to identify operational or physical failures that, once identified, could be prevented by a redesign of a part, system, or operational procedures. When people apply these methods to the software profession and expect analysis of critical problems to prevent future failures, they fail to understand the root causes of software defects.

They falsely assume that something that causes a major customer failure must somehow have been caused by a major oversight or repeatable cause, where the consequence of the fault will always be proportionate or related to the initial error. For software this is not true. Single-character coding errors caused Mariner 1 to crash into Venus and the 800 telephone system to crash back in 1994. This means that RCA of serious failures will not consistently prevent other serious failures, as the root cause of simple failures may generate serious failures in other parts of the product. A simple typographical error can have a minor impact in one case and a catastrophic failure in another case.

  Software Other Engineering Discipline
1 Many failures of varying consequence Single events with major and often catastrophic results
2 Failures caused by “intellectual” shortfalls Failures caused by physical interactions or mechanical fatigue, operational errors, management failures, or design errors
3 Many common-cause faults Typically unique faults
4 No relation of cause to failure or future failures in many cases Cause may repeat with similar consequence
5 Typically low effort per failure Often significant effort, many times with political overtones
6 Prevention may be well into the future; requires investment to prevent future errors Scapegoat and financial responsibility


If we are lucky enough to identify a common process failure related to a specific failure mode, then RCA will have a benefit. This leads us to identifying a common cause for multiple failures, which is the third item in the tabled list. By systematically analyzing multiple failures, patterns of common cause may be identified, leading to a single fix in a requirements, design, or coding process that eliminates multiple faults with one change. A secondary impact of this item is that RCA of single failures is self-defeating, as patterns will not be apparent until multiple failures are analyzed and common causes identified. If you go back to one of the original papers on Defect Prevention and search for "Defect Prevention"), you'll find that the RCA process involves collecting data from multiple failures and analyzing them as a group.

The second item, intellectual vs. physical, is one of the reasons the first and fourth items present their difficulties. Metal fatigue, for example, can be attributed to specific causes that, once eliminated, ensure these failures will not be repeated. The human mind, however, is not so accommodating. If we look at some of the reasons errors get into software, such as communications loss, noisy work environment, multi-tasking impact on short term memory, etc., we need to address the sociological aspects of our profession rather than mechanical or chemical aspects. How many of us work an entire eight-hour day without interruptions? How often do you start a one-hour task in the morning and find that at day’s end you have not finished it--and the next day, as you start over, you've forgotten a critical aspect of the program that was your next task the day before.

Item six has multiple consequences. To be effective in software development, the real root cause usually requires the person making the mistake to be involved in the analysis. If there is fear of retribution (scapegoating), the incentive to identify the root cause is eliminated. The second issue is the time relationship between discovery of the root cause and the chance to prevent the problem in the next development cycle.

I came across an organization that was doing RCA on production failures with the expectation of significantly improving quality. Their typical release schedules were twelve to eighteen months. This meant problems found in the requirements or early design activities and eliminated from the next release cycle would have a twelve- to eighteen-month delay before showing up as improved quality in the next release--not what they wanted. To be effective, the time delay between the error introduction, discovery, root cause analysis, repeat of the activity that introduces the error, and impact on the next development or production cycle should be as short as possible.

The third issue with item six is "what to do with the result" of the RCA. For major catastrophes, finding the scapegoat is often the real reason behind RCA, as lawyers and victims get in line for compensation. In software, we are looking to prevent future occurrences, which means we need to change something: process, development environment, work environment, etc. Change means some effort or cost will be incurred to make the change. If this cost is not budgeted, how will it happen? All too often this falls into the "now a miracle occurs" part of the plan--or lack of plan--for preventive action. Whether the preventive action is as simple as a checklist update or complex as changing the development environment or process, allocating some budget for this endeavor is mandatory. Telling your development teams to do something in zero time or at zero cost sends a message that the activity isn't worth much.

Side Note: Typically we use the terms error, fault, defect, and failure in sequence to correctly describe what happens. A person makes an error which introduces a fault in the product. This fault results in a defect when the program executes, which may or may not result in a failure visible to the user. Casual writing quite often uses the terms bug or defect to mean all four. Usually context will tell you which is which, and, hopefully, I (hopefully) was consistent in this article.
Summary
To effectively apply RCA in your organization:
  • Make RCA a formal, budgeted activity
  • Avoid scapegoating
  • Involve the people who made the original error
  • Do RCA on groups of failures, looking for common causes
  • Understand both temporal delay and causality (linkage of cause to significance of the failure)


About the Author
Ed Weller is an SEI certified High Maturity Appraiser for CMMI® appraisals, with nearly forty years of experience in hardware and software engineering. Ed is the principal of Integrated Productivity Solutions, a consulting firm that is focused on providing solutions to companies seeking to improve their development productivity. Ed is a regular columnist on StickyMinds.com and can be contacted at efweller@aol.com.

Back to Top
 

StickyMinds.com Weekly Column From 1/28/2008 

Member Comments
Add Your CommentExpand Comments
 
Comment:    
by Frank Gorham-Engard 1/29/2008

I have tried some informal RCA in my present company, which is not even up to CMMI level 2. The most troubling and expensive problems trace back to, not errors but, guesses made out of necessity in order to proceed. When we can trace back to an error it is rarely a mistake where the developer knew better but was lax , clumsy or careless, lacking discipline. It is almost always the lack of some piece of the vast amount of information that each developer must know in order to avoid mistakes. A language feature used but not completely understood, a specification changed or incomplete with no notification, the pressure of managers without the...Read On

Author's Response:
1/30/2008    
Frank,

You have hit on another reason why Causal Analysis and Resolution is a Level 5 process area in the CMMI - most organizations can address the serious problems they have by better project management and requirements/design methods - Level 2 and 3 activities - (that fit the product environment). Less formal Lessons Learned or retrospectives can identify these process problems - now the org has to do something about them, else history repeats.

rather than addressing problems 1 by 1, looking at the way the organization manages its work can eliminate problems wholesale. For instance, there is overwhelming evidence that compressed schedules and more people lead to higher defect rates. Rational staffing and scheduling will reduce defects wholesale, without the need to analyze multiple defects to find common causes.

Ed

 
 
Comment:    
by Mark Crowther 1/29/2008

Hiya Ed, It's great to see this article and thanks for taking the time to write it. As en ex manufacturing QA now in Software Test the lack of RCA and Corrective Action Planning in the software testing and development domain stuns me. These two activities provide a measurable ROI for the business from its test activities, really hit the Cost of Quality and affect the bottom line of the business in real terms.

With RCA I've found the test team usually identifies a proximate cause but it needs collaboration from the development team to analyse the collection of issues and find the root cause. Which is a great way to get them...Read On

Author's Response:
1/29/2008    
Mark,

Absolutely right--the person who made the error must be involved since we are talking about "intellectual" errors. Otherwise any root cause is merely conjecture or guessing. Sometimes these guesses are accurate, but one aspect of analysis is that the originator has to really think about what they did, and it makes the corrective action "stick".

Ed

 
 
Comment:    
by Sanat Sharma 1/29/2008

I personally appreciate all the 5 points mentioned in the summary section of this article. RCA is definitely a brainstorming exercise but sometimes, I have seen that RCA meetings become a blame storming exercise. We should be clear about how RCA should be effectively applied to any organization and this article is giving a good picture of that.

-- Sanat Sharma

Author's Response:
1/29/2008    
Sanat,

If the group devolves into blamestorming, they (and possibly the organization) are not ready for RCA. Perhaps this is one reason Watts Humphrey put Defect Prevention at Level 5 in the CMM? Ultimately, management sets the tone for RCA success. I saw one company demote a developer as the result of one of the first RCA reports. A standard process, that no one followed including the developer who was unfortunately the scapegoat, was the cause. Guess what happened to that initiative?

Ed

 
 
Comment:    
by Robert Rose-Coutre 1/28/2008

I would add a point to your summary: "Include a prevention component." Unfortunately row number 2 in your table sometimes leads QA or Development managers to ignore the root cause. Taking your example, if a noisy office one day leads to a programmer committing a typo in the code, you might say it's an isolated incident and take no further steps for future prevention. I think root cause analysis should always have a final prevention component, or else it's not really a legitimate root cause analysis. The final prevention component should be a formal step that requires a strong argument for "taking no further steps" and require at least a...Read On

Author's Response:
1/28/2008    
Robert,

YOU have hit on a "sticky point" for me--at one time I refused to lead any more Lessons Learned in one org I was in because there was never funding for the corrective action or carry-over to other projects or parts of the organiation.

I did not mention the CMMI in this article (for a change) but one component of Causal Analysis and Resolution at Level 5 is exactly what you are mentioning - there needs to be coordination across parts of the org and multiple RCAs to identify frequent small problems that in total are big problems

Ed

 
 
Comment:    
by Srinivasan Desikan 1/28/2008

I had an opportunity to go through the books on "Toyota way" and visted their NUMMI factory to understand their processes. One thing that makes Toyota much different from others is in "finding the rootcause and fixing the problem at it's source" than "finding and fixing each problem instances". I appreciate the author for the content and timing of this post on RCA. Thanks

 
Back to Top



 
Ads By Google
What's This?
 
 



Home   |   Resources   |   Topics   |   Community   |   PowerPass



© 2010 StickyMinds.com. All rights reserved.
StickyMinds.com is a division of Software Quality Engineering.
Privacy Policy    Terms & Conditions    Link to StickyMinds.com    Feedback


ThoughtWorks




Agile Development Practices 

STARWEST