Where were you on 31 December 1999? Celebrating on a beach in New Zealand, one of the first places in the world to experience the new millennium? Or in your office, hastily testing changes to old computer systems (with all vacation requests denied for a period of six weeks before and after New Year's Eve)? After all, the cover of BusinessWeek magazine had used the term "Global Financial Meltdown" to describe the concern. With hindsight, what have we learned about testing and quality assurance from the year 2000 problem (also called the Y2K or Millennium Bug)? For the last two years nobody has wanted to talk about it-but it should now be safe to bring up the topic.
As a consultant, I found the Y2K scare to be a profitable venture. Fear is a great motivator, and organizations that for years had under-invested in testing suddenly decided that money was no object. (Fortunately, I did not publicly say anything too stupid about the end of civilization and the cockroaches inheriting the earth, so I don't have any embarrassing statements to retract.)
In 1997, the University of California-Berkeley, asked me to make a series of television programs about Y2K for their executive education network. (This is a subscription-only network run by Berkeley.) The university had great expectations, but to their disappointment the series was a bust-very few CEOs, CFOs, or CIOs tuned in to watch.
At the time, I thought the problem was that we were too late with the message. Presumably, these forward-looking executives were already on top of the issue and did not need to know anything more. Over the next several months, the truth dawned that actually we were way too early. In 1997, Y2K was not yet on many radar screens.
The perspective of internal employees-who had to do all the work to repair, replace, and test systems, instead of merely talking about it-tended to be different from that of outsiders. Often these people were "volunteered" for projects that were not technically innovative, forcing them to miss new opportunities such as learning about the Web and to spruce up skills in old technologies that quickly became obsolete after Y2K; furthermore, they were subjected to intense deadline pressures and anxious managers looking over their shoulders. But they were promised the undying gratitude of their organizations for staving off potential disaster.
Background to the Problem
The year 2000 problem started many years ago when the data storage in computer systems was scarce and expensive. To conserve data storage, software engineers adopted the practice of representing dates with two digits for the year instead of four digits. For example, the year 1977 was represented by "77" in storage and in all date-based comparisons and computations.
The requirement to use only two digits for the year was even mandated by a U.S. government standard (FIPS), which was intended to be followed by all government agencies.
The unstated assumption was that all these systems would be obsolete and replaced well before the year 2000 arrived.
As a result of the two-digit years, on January 1, 2000, computer hardware, operating systems, databases, and application systems that had not been updated would effectively reset their internal clocks to January 1, 1900, or some other incorrect date, or refuse to operate.
In other words, without costly, risky, and time-consuming renovation or replacement, millions of existing computer systems were destined to fail at the end of the last millennium.
Imagine the following "day in the life" scenario. We had a lot of fun videotaping this scenario as the three-minute introduction to the Berkeley TV series, using one of the camera crew as the actor:
You are enjoying a quiet day at home, but feeling (and looking) mildly hung-over after indulging in a somewhat hectic holiday celebration the night before. (The actor looked appropriately disheveled.) The mail arrives. You toss out the junk mail and open your credit card bill. The restaurant meal of $38.92 you charged three weeks ago is on the bill, plus a late charge of approximately $1.8 billion (at 18% annual interest, compounded monthly for 99 years).
You pick up the telephone and cannot get a dial tone. Your telephone and Internet accounts have been canceled. According to their records, the communications utilities stopped trying to persuade you to pay your bills years ago and have written you off as a deadbeat. You decide to watch the Rose Bowl game on TV, but there is some kind of glitch with a communications satellite.
At least you can still receive postal mail. Your mail today includes a letter from your employer, saying that the office will closed indefinitely until the organization resolves unexpected, emergency computer problems. Your employer will contact you when the offices are open again. No word about how (or if) you will be paid in the meantime.
The TV news has coverage of a nuclear power plant that has exploded. Something about problems with an automated system that controls routine maintenance at the nuclear plant-it became confused about the date and shut off the flow of coolant to the reactor. There is another story on the news about senior citizens begging for food in freezing weather, because their government pension checks have not arrived.
What nightmarish day is this anyway? Saturday, January 1, 2000.
The Real Problem
The Y2K fixes, while numerous, were in themselves straightforward and low risk, perhaps 1 or 2 on a scale of 1 to 10 of the difficulty of software fixes.
Various crackpots devised magic solutions-automated tools that they claimed could scan through hundreds of thousands of lines of existing source code per hour, identify Y2K problems and automatically repair them for a cost measured in pennies.
Why do I call the tool developers crackpots? Several of these solutions actually worked, though the results still required laborious manual double-checking. No, it's because making the repairs, while essential, was far from being the biggest problem. The more important issue was the unseen, unintended side effects of those repairs. In many organizations, the regression testing consumed the largest part of the budget.
The real problem was (and still is) the maintainability of the code. Code may have originally been written with little thought given to its maintainability; for example, the structure may not have been particularly modular and logical.
Old code often runs in an obsolete technical infrastructure (so that very few if any people can be found who understand it); has documentation that is obsolete, missing or incomprehensible; or has been patched over the years (with patches on top of patches). The code no longer has an architecture, but seems more like a murky blob of entangled spaghetti, where everything connects to everything else, so that it's highly likely that a simple change will cause a defect to propagate to unrelated parts of the system. And with ongoing turnover of the people who use the system and who are assigned to maintain it, nobody really understands how the old code works.
Software entropy says that the reliability of software degrades over time: after code reaches a certain age, each modification is likely to insert more new defects than it removes. If the original design was informal and maintenance practices have been casual, it usually takes software only a few years to reach this point.
Testing does not catch everything, especially when performed under hectic deadlines. Based on audits of the efficacy of Y2K fix projects, Capers Jones of Software Productivity Research estimates that for every one hundred Y2K fixes, seven new defects (7%) were introduced that still remain-they have not yet been found and removed, despite all the regression testing. Many of these Y2K-introduced defects will not be uncovered for years.
By 1998 or 1999, testers could not do much about the real Y2K problem-mediocre to poor software maintainability. But we may be able to influence future system design and maintenance practices. We know how to design maintainable systems and preserve maintainability as a system ages, but out of convenience or ignorance, these practices are often not followed.
Because the predicted firestorm of Y2K problems never happened, we may have been lulled into under-preparing for possible future disasters. Legal groups such as associations of trial attorneys, no doubt in eager anticipation of plentiful fees, had predicted the total Y2K litigation potential to be as high as $1 trillion (yes, that's with a "t"). The actual amount of Y2K damages paid probably did not exceed $1 million, according to the Wall Street Journal. This works out to less than a penny for every $10,000 that had been anticipated.
Depending on which news report you read, societies worldwide spent as much as $500 billion on fixing Y2K computer problems. Many experts believed that without costly and time-consuming renovations, millions of existing computer systems were destined to fail at the end of 1999. Prudent people hoarded months' worth of food, guns, survival gear, and good trashy novels. Doomsday prophets forecasted that the world would end at midnight, 31 December.
With hindsight, though, many people now believe that the vast majority of the expenditures for Y2K computer fixes were unnecessary and that Y2K was an anticlimax. But the post-Y2K complacency is misguided. The true story is that the event was an anticlimax precisely because the heavy expenditures helped to avoid numerous fiascoes.
If you consider the real-world consequences of not fixing the problems prior to Y2K, these preventive expenditures were minor. In the Y2K effort, the measure of success is what did not happen after the beginning of 2000.
What Could Have Gone Wrong?
The number and severity of Y2K problems that were found in test mode, where midnight on 12/31/99 was simulated in testing before it actually happened, more than justified the preventive expenditures. Examples of problems found in test mode (and fixed before Y2K occurred), according to various news reports, include
- In New York, elevators froze. The elevator software believed that the elevators had not been maintained for 99 years. They shut down, as they were designed to do when their maintenance was overdue.
- In Southern California, an office building locked up during the Y2K test and refused to accept employees' scan-card building passes. A fire emergency was declared, and the local fire service used fire axes to break open plate glass windows in order to provide an exit from the building. (This example may be an urban legend, but it was reported in the local press. How many test professionals can ever say that their testing has caused fire engines to rush to the site?)
- In Lubbock, Texas, all the cell doors in a state prison unlocked at the simulated time of midnight on 12/31/99. Party time for the long-deprived felons.
- Boeing reported that 750 of its airplanes were found to have Y2K problems, with more than 50 planes impacted to the point where they would not be able to fly.
- Police dispatch systems, credit card systems, emergency telephone (911) systems, and oil pipeline pumping systems, among others, also went
awry during testing with simulated Y2K dates.
With hindsight, perhaps there was over-testing, but the consequences of under-testing would have been far worse.
Y2K was widely viewed as an anticlimax and the mitigation efforts as vast overkill. But the efforts were successful-an anticlimax is exactly what society wanted to achieve. What were the outcomes of the event? I make the following observations:
- Perhaps the more important payoff has been an inconspicuous byproduct. Billions of lines of source code have been cleaned up, including mundane tasks such as upgrading documentation. Untold new problems caused by introducing changes into old buggy code have been averted, and the total payoff could well dwarf the estimated expenditure of $500 billion.
- Another important byproduct has been the maturation and more widespread utilization of automated regression testing. Most test tool vendors and test consulting firms experienced a surge in demand and revenues that have led to faster product improvement than would have occurred otherwise.
- Fear is a great motivator. Testers who had struggled for years to gain commitment and support suddenly found the budget floodgates opened.
Like a doctor saying that she needs a multimillion dollar medical device
or else the patient might die, smart testers tied everything on their pent-up wish lists to Y2K, such as the need for expensive automated test tools.
- Fixing problems is the tip of the iceberg; regression testing is the bulk. Organizations that undertook reasonably prudent steps found that the effort to retest after changes to old code was often three to five times bigger than the effort to make the changes.
Unfortunately, many things remain the same:
- The law of software entropy (some call it software rot) remains in effect. Reliability of software degrades over time, as changes introduce a gradual accumulation of unintended bugs. Entropy relentlessly continues.
- In my observation, there is only minor progress on the most important payoff we could have hoped for. Organizations have not faced the root causes or significantly changed the convenient bad practices that made Y2K so dangerous in the first place: entangled system designs, hasty changes, and inadequate regression testing. It's a little like someone going to a dentist for an intensive and painful deep cleaning, then lapsing back into lackadaisical dental hygiene practices.
- The problem has not gone away. Myriad date-related incidents lurk waiting for the right moment, such as the resetting of the Unix clocks in 2038. Everybody is confident that all the old Unix code will be replaced well before then. That's what we thought in 1990 about the Y2K bug.
Finally, some lessons learned:
- On any project where attorneys predominate, the test documentation seems to double and the project costs triple, and the testers' rates of reimbursement look exceedingly modest. In some organizations, every third member of the Y2K team seemed to be an attorney.
- Be careful what you predict. Some industry pundits became carried away in their pronouncements of gloom and doom and looked ridiculous after the fact.
- Don't expect the press-or technical professionals-to get the story right. We can't predict where the bugs will be or precisely how much testing is enough. Given these uncertainties and the acknowledged huge degree of risk, the effort that was viewed as overkill turned out to be a prudent, justifiable response.
- Don't expect to be rewarded with the key to the executive washroom. In a survey conducted in mid-2000 by Howard Rubin, seventy percent of the information systems professionals who were significantly involved in Y2K projects were dissatisfied with their rewards. They felt that the bonuses, retraining, and new positions promised by their organizations in return for their Y2K efforts had not materialized. A mere two percent thought their rewards were above expectations. Says Rubin, "In general, Y2K folks have gotten few rewards. While digging in on Y2K, many missed the e-Business boat. And because there was no Y2K bang, industry executives gave them no bucks. It's sort of a paradox because they should have gotten lots of bucks for no bang."
Now that Y2K is well behind us, we can forget about all this date nonsense until the year 9999 (when the Y10K panic will hit, and we will have to convert all the COBOL code yet again)-right?
Unfortunately, we will have opportunities to apply the methods of Y2K repair and testing well before the year 10000. In the United States, the set of possible telephone numbers probably will be completely allocated sometime in the first decade of the 21st century. The U.S. will have to convert from the present ten-digit system, which will likely be painful and full of errors. Some time within the first two or three decades of the 21st century, the U.S. will also run out of Social Security numbers.
Both the telephone numbers and Social Security numbers are pervasive, which means that major data conversion efforts will be needed-similar in the types of activity and possibly similar in size to the Y2K effort.
There are more lurking date problems too, such as the resetting of the Unix clocks in 2038. The Unix problem does not affect 64-bit systems, at least not for a very long time, and some people are banking on not having the older 32-bit Unix systems around when the date rollover happens. Back in the 1980s and early 1990s, though, people said that there would be no Y2K problem because all the old systems would be replaced long before 2000. Some hapless Unix system administrators will no doubt be feverishly installing fixes minutes before midnight the last day of 2037.
Perhaps the most significant thing about Y2K is not that it was perceived to be a crisis and that it was managed in an acceptable manner. There have been many crises in human history. But Y2K was the first worldwide crisis that was software-driven. There undoubtedly will be more.