What Do You Do when a Showstopper Escapes into Production?

Member Submitted
You are the QA manager of a company developing an enterprise application. Last week, your team released a product version, including features requested by new customers. But, it included a showstopper that already has affected about a third of customer installations. What would you and your team do in this situation?

Here’s a real life scenario. Let’s see how many of you can relate to it:

It is the middle of the day on a regular Tuesday afternoon. You are the QA Manager of a company developing an enterprise application, and last week your team released a minor version that included the fixes from all the patches of the past two months plus four minor features that were requested by product marketing in order to “close some pretty important deals."

All of a sudden, the phone rings. It is your R&D director “inviting” you to an urgent meeting in his room. You arrive to find the R&D director, together with his development team managers, the product marketing manager, and the support team leader in charge of your product, who is standing next to the whiteboard…

As you sit down, the support team leader tells all of you about an urgent showstopper that was released in last week’s version and that is expected to affect about a third of the companies who install this upgrade.

I want to stop this scenario (that happened to me about seven years ago) to ask you a simple question: What would you and your team do in this situation?

I believe that no two teams would react in the same way, and I don’t want to come up with best- and worst-case scenarios, but here are two contradictory approaches to serve as points in the possible behavioral continuum.

Scenario No. 1Blame and Panic
Step 1
The meeting turns into a witch hunt, where development blames testing for not finding the bug, then testing blames development for not documenting all the changes made in the system, then development and testing blame product marketing for pushing the teams to release even though not all the tests had been completed, etc.

Step 2
After the meeting dissolves without any clear action items, support starts telling customers not to install the new release, the programmers start working on a solution without fully understanding the problem, and you are left on the side wondering how you missed this bug and trying to find the person who should be responsible for it.

Step 3
Since the developers think this is absolutely urgent, they decide to send the fix directly to support, and only in parallel they send it to your team for validation and verification. They do this at 8:30 p.m., when no one is left in the office and you can only start testing it at 8:30 a.m. the next day. About thirty minutes into your testing cycle, you start finding bugs in their new version. The problem is that your support team already started delivering this solution to the initial set of customers who already complained about the bug.

Step 4
By midday, your team finishes the tests on the system. They find that the initial fix only works on about half the supported configurations and, more importantly, it also causes a regression bug on an area not directly related to the fix.

Within thirty minutes, you get a new version that is verified and released to the support team by 4 p.m.

Step 5
Customer support wants to kill both your testing team as well as the developers, because now they need to find every company that downloaded the first fix and call to ask them not to install it—or, even worse, to install yet another fix on top of it.

Product marketing is also mad at you, since they already started getting calls from customers, account managers, and even some of your company’s top executives complaining about the mess and bad publicity this fiasco is already creating in the field. They let you know that as a result of it your company will need to offer large discounts to all customers that complain, and they think this issue may cause a number of important deals to be delayed or lost.

Step 6
Step into your time-machine and fast-forward three to four months ahead. Everything is still the same. No one was fired after the fiasco, but the atmosphere was tense for about one week. After that, it became water under the bridge.

Your team is about to release another minor version, including all the patches of the past months plus another three features needed for important deals.

As always, product marketing is pushing for the release to go out on schedule even though your team got the final build a week late and you learned only yesterday that they included another feature that you were not even aware of.

As they say, "Nothing changes, if nothing changes."

Scenario No. 2Solve, Learn, and Improve
Step 0: Don’t Panic!

The last thing you want to do is to start looking for someone to blame. Chances are good that more than one person is “to blame” for making mistakes that led to the issue's having been released. It is almost certain that no one did this on purpose and, most importantly, blame will not help you to solve the issue!

So, try to stop your basic instinct to blame someone else for the issue. If another member of the team starts the blaming game, immediately ask him how this is contributing to solving the issue any faster. You can also state that there will be enough time after the issue is solved to understand what went wrong and why.

Step 1: Fix and Test
OK, so there is a bug out there and you had better fix it quickly! Put together the best team you can that will: analyze the issue; define the quickest, safest and most effective fix; create this fix; and, finally, recommend how to test it.

Which testing approach to take is not a simple decision, either. Depending on the bug and the product you are working on, you may choose to deliver the fix directly without doing any tests and verify only after the issue has been solved. (For example, if your application is web-based and the servers are down, it is better to get them back up and test once they are up). On other occasions, you may choose to run some or all possible tests in house before releasing the fix. This is usually the case if the bug is not really critical and the fix may cause bigger bugs, such as data loss or business disruption.

In short, the first step is to create the best solution and find the most appropriate approach to get it to your customers.

Step 2: Analyze
Once the critical part is over and the issue has been solved, but before people “move on” with their lives and tasks, it's better to make sure you understand what went wrong. The analysis process should never be a witch-hunt. It should be an opportunity where everybody feels safe to collaborate and bring forward all the factors—both internal and external—that contributed to the problem happening in the first place. This activity is fairly common and is called a retrospective or post-mortem.

Step 3: Corrective actions
Once the factors and issues have been identified as part of the post-mortem, the next step is to define corrective actions to prevent this from happening again. Make sure these actions are clearly defined and actionable (duh!). Many times, we see actions like “Make sure communication is better,” but this is not actionable at all and will not help anybody to change the way they have communicated up to now. So, as dumb as it may sound, make sure your corrective actions are actionable, and they will lead to a change in the way things have been done in your company up to now.

Strong Teams Share Their Failures, while Weak Teams Sweep Them Under the Rug
I remember coming up with a phrase a couple of years ago as part of an presentation I did for a group of testers: “When you are a true professional, you see failure as an opportunity to become a better professional.“ I have seen many development teams that perform retrospectives but then are afraid or ashamed to share their results with the rest of the company. Would you blame the team for being self-centered and egotists? I would actually blame their companies for not making sure their work atmosphere encourages teams to take risks and learn from their failures.

You need to make sure your team—and preferably your company—encourages the sharing of retrospectives and corrective actions. It is one of the best sources of free advice available. Also, if you look closely, teams with the confidence and maturity required to openly share their risks and failures are also the most fun and challenging teams to work in.

And, if all this was not enough to convince you, just go ahead and read the Dilbert Comic Strip published recently on the subject.

How would your team react? Do you have any war stories or insights into good or bad reactions? Share them in the comments.

About the author

StickyMinds is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.