A number of years ago, my organization had problems with consistently tracking and measuring system outages. Issues were not being logged, admins were making changes to systems without going through change management, and there were a high number of issues that were recurring problems.
As part of our remediation process, we implemented a performance measurement process to measure system reliability.
Categorizing Applications and Systems into Severity Levels
First, we organized our systems into criticality tiers, assigning a criticality level to all systems based on the impact to our business:
- Mission-critical: The system is critical to the support of our end-customers. I work for a utility company, so these are systems that enable us to provide safe, reliable generation and distribution of electricity to our customers.
- Business-critical: The system is critical to company operations. These are systems related to purchasing, payroll, and other business systems where a brief interruption of service can be tolerated.
- Critical: These are operational systems where an outage is not as critical but it is still important. The application we use for tracking testing and defects for applications not in production is a good example. We can tolerate an outage for a period of time but cannot go for an extended period without this capability.
- Not critical: Just as the name implies, these are systems we support where outages can be tolerated. Legacy applications that are needed to store data for past activity but are not being used in day-to-day operations are a great example.
Service-level agreements (SLAs) surrounding reliability were defined by criticality tier to further simplify our measurement process. We now review all systems annually to ensure our systems are assigned to the correct criticality tier. Our system-hardening efforts (for redundancy, load balancing, etc.) also reflect this criticality perspective in that mission-critical systems now have a higher level of failover and load balancing as part of their design.
Discovering the Root Cause of Failures
Second, when a user contacts the service desk (SD) to report an issue, the SD team evaluates the issue to determine whether it is a single-user issue or a larger system outage. We have all seen a case where a PC stops working and to the user it appears to be a system issue, when in fact it is resolved by a PC reboot or verifying that the PC is connected to the LAN.
When a system-level outage is recognized, the IT management team and support groups are alerted to the issue by both email and IM. Once the issue is resolved and access is restored, the amount of downtime is measured. We also perform a root-cause analysis (RCA) for each outage of a mission- or business-critical system.
An RCA is not a blame session; things happen and mistakes occur. Our teams are very qualified and talented, and we remain focused on the event itself instead of the people who may have made a mistake. This allows our teams to be honest and supply all the relevant information to those correcting the problem.
Instead, an RCA is about determining what happened and when—this includes doing an investigation and breaking through the symptoms to find the underlying root cause of the issue—as well as how we can prevent a recurrence or lessen the impact of a repeat event. By building a knowledge base of problems and solutions, we can avoid issues, or at least respond and resolve them much faster when a problem comes around a second or third time.
RCA investigations are kept open until the mitigation actions are in place and the issue can truly be considered resolved. The goal of the RCA is not to place blame, but to continuously improve and recognize that things can go wrong. We also keep our RCA process simple and consistent to make it easier to follow and repeatable.
Calculating Uptime and Downtime
Third, we simplified our uptime calculation to make it easier to track. For each criticality tier, we take the count of systems and multiply it by 24 hours and 365 days in a year to come up with the total hours for that tier. We calculate and report uptime in both the most recent month and year to date. As outages occur in a given month, we count the duration time of the outage and calculate uptime as follows:
(Total hours – Outage hours) / Total hours = % Uptime
It is a recognized issue in our method that the uptime calculation does not consider planned outages that may be associated with upgrades and maintenance, but it’s complicated to factor the downtime for these planned outages into the math, and we have chosen to err on the side of simplicity. The result is a higher than actual uptime metric when planned downtime occurs, but it’s a metric that is easier to calculate and track.
While our performance measurement process has been effective, members of the IT group have questioned some of our calculations. Here are the common questions we get about why we do things the way we do:
- Why do outages that don’t impact users get counted as downtime? For instance, if a patch causes a system to fail on Friday night, users don’t use the system until Monday, and it gets fixed on Sunday, why is the downtime between Friday and Sunday logged?
Answer: We get credit for system uptime when planned outages occur, so when a system is down for an unplanned event—irrespective of whether a user is impacted—we feel we need to include this actual outage duration in our uptime calculations.
- If an outage is fixed but the incident ticket is not updated on a timely basis, why is downtime calculated until the ticketed is updated? In this situation, the issue was fixed but the technician or other party neglected to update the ticket used to calculate outage time.
Answer: We have a system of record for outages, and to ensure accuracy, we base it on when a ticket is updated after the issue is resolved. If evidence can be shown that the issue was resolved prior to the time on the ticket (such as an email from the user reflecting the problem was solved), we can include that as information in the calculation. But we do push to have the incident details be our data source.
- If a vendor cuts the fiber cable and causes the phone lines to be down, why does this count against my uptime? The phones were up, just not able to take or make calls. In telecom, we measure uptime at multiple locations with different levels of criticality based on the operations at that location.
Answer: By measuring this downtime, we can see which locations would benefit from a technology upgrade, and this outage data is important to justify capital requests.
- A portion of the data center had an issue, causing a number of applications to not be available. Why should the applications and the hardware be counted as down?
Answer: The user does not know why their application is not available, just that it isn’t. We need to reflect the user’s perspective in our uptime calculation if we expect our users to trust us. In addition, looking at the true impact of an outage can help justify further system reliability investments.
Goals for uptime are set at the start of each year and are part of our compensation model. This is an effective way to help ensure our teams are focused on what is most impactful to our customers.
We publish a monthly uptime report reflecting the performance dashboard for uptime and a number of other related graphs and tables that help communicate the information hidden in the data. An annual report is also compiled where the goal is to see the systems with the most issues—both the number of outages and the total time unavailable. This is a nice method to see the bigger issues.
Pushing for Consistency
Our performance measurement process is a simple but effective approach for providing visibility and sharing information. We have been able to drive down unplanned downtime by more than 80 percent.
While there was initial hesitation about using this approach, it is now an accepted part of our business. The operations teams rely on this data, and leadership uses it to monitor system performance. If you’re experiencing system inconsistency or recurring performance issues—or you just want to make your operations better and more reliable—consider implementing a process to measure performance. There may be some pushback at first, but if you keep with it, the visibility into your operations will become invaluable.