Improving the Accuracy of Tests by Weighing the Results

[article]
Summary:

Test automation is often hard to do well, especially when testing a complex system where the results may be dynamic and hard to predict precisely. For instance, how would you test a price finder for highly volatile products such as airline tickets, where the data comes from live, third-party systems? While manual testers can interpret such results to decide whether a system is working correctly, the interpretation may be much harder to codify in an automated test. In this week's column, Julian Harty describes an approach that may help you to improve the accuracy of your automated testing even in these volatile environments.

Heuristic Test Oracles
One of the biggest challenges when automating tests is deciding whether the results are good or bad. We could compare against a reference-commonly called a test oracle-however complete test oracles for complex systems may be extremely expensive and are likely to have their own problems. Doug Hoffman wrote a great article on heuristic test oracles in the March/April 1999 issue of STQE magazine. These are relatively simple, albeit imperfect, models of the behaviors we expect from a system.

Let's see if we can create a heuristic test oracle that is able to cope in complex, volatile environments. The oracle will need to be sufficiently accurate to enable automated tests to score results and determine if the results fall into one of three categories: good, bad, or unknown.
Weighing Responses to Tests

In software as in life there are things we notice that help confirm whether something is satisfactory or unsatisfactory. Let's call these things that affect our judgment of the results "factors." Some factors provide stronger indications than others. When using these factors to rate results, we will assign higher scores (or "weightings") to the stronger indicators. By assigning higher weightings to the stronger indicators, we enable these to have a stronger influence on the overall outcome.

Some factors are positive, and when we detect them we are more likely to view the result as satisfactory. Others are negative and more likely to lead us to view the result as unsatisfactory.

As a real-life example of how factors affect our judgment, let's examine the way we judge a meal at a restaurant. Positive indicators might be the décor (a minor factor), the service (a medium factor), the taste of the meal (a strong factor), etc. Negative factors might be dirty crockery or, worse, an insect in the food.

For the airline ticket scenario mentioned in the summary, positive factors might include: structured HTML results with a section heading titled Flights (a minor factor), a Web form such as bookNow (a medium factor), and well-formatted results for the locations and dates requested (a strong factor). Negative indicators might be the lack of the Web form bookNow (a medium factor), or HTTP error codes like the dreaded "HTTP 500 server error" (a major factor). Sometimes the negative factor may override any or all of the positive indicators, such as the server error in this example. Some factors may be structural, such as the HTML elements, and others may be related to data, such as the details of the flights.

For test automation, we need to assign scores to the factors so we can codify them. A good way to start is to use a table (see table 1 below).

Table 1: Factors and Assigned Weightings

Indicator

Weighting

Range

Comments

HTML heading: Flights

Low

0 if missing, 1 if present

 

bookNow HTML form

Medium

-1 if missing, +2 if present

 

Well-formatted flight data

High

0 if missing, +5 if >= 1 result

We may want to consider having an upper limit, based on business rules

HTTP error codes

High

-10 for either 404 or 500 errorcodes

 

The ranges should be set so an overall negative score corresponds to unacceptable results and an overall positive score indicates acceptable results. A good way to assign weightings is to ask someone to describe the factors that help him decide whether a result is valid or trustworthy. We can also create mock-ups and suggest possible outcomes and see how they react. For instance, ask "how would you feel if the flight information was written as one long string?"

When we construct

About the author

Julian Harty's picture Julian Harty

A senior test engineer at Google, Julian Harty finds ways to test lots of fun products, including the mobile wireless software used by millions of users worldwide. He's been involved in software and online systems for more than twenty years and enjoys working with others to find ways to solve testing challenges productively. A presenter at both STAREAST and STARWEST, Julian has been involved in international conferences and workshops on software testing.

StickyMinds is one of the growing communities of the TechWell network.

Featuring fresh, insightful stories, TechWell.com is the place to go for what is happening in software development and delivery.  Join the conversation now!

Upcoming Events

Aug 25
Aug 26
Sep 22
Oct 12