TrainingConferencesAbout UsContact UsAdvertiseSQE.comRSS Feed

StickyMinds.com: brain food for building better software

Log In
 Clarify Your Search Criteria

Tips on Using Our Search Feature(s)
 
StickyMinds.com Home
ResourcesTopicsCommunityPowerPassBlogs
Home  >  Detail: Improving the Accuracy of Tests by Weighing the Results



A StickyMinds.com Original
Article Picture
Improving the Accuracy of Tests by Weighing the Results

By Julian Harty

Send This Content to a FriendGet a Short Link to This ContentPrint This ContentSee User Comments About This Content

Summary: Test automation is often hard to do well, especially when testing a complex system where the results may be dynamic and hard to predict precisely. For instance, how would you test a price finder for highly volatile products such as airline tickets, where the data comes from live, third-party systems? While manual testers can interpret such results to decide whether a system is working correctly, the interpretation may be much harder to codify in an automated test. In this week's column, Julian Harty describes an approach that may help you to improve the accuracy of your automated testing even in these volatile environments.


Infosys
Heuristic Test Oracles
One of the biggest challenges when automating tests is deciding whether the results are good or bad. We could compare against a reference—commonly called a test oracle—however complete test oracles for complex systems may be extremely expensive and are likely to have their own problems. Doug Hoffman wrote a great article on heuristic test oracles in the March/April 1999 issue of STQE magazine. These are relatively simple, albeit imperfect, models of the behaviors we expect from a system.

Let's see if we can create a heuristic test oracle that is able to cope in complex, volatile environments. The oracle will need to be sufficiently accurate to enable automated tests to score results and determine if the results fall into one of three categories: good, bad, or unknown. Weighing Responses to Tests

In software as in life there are things we notice that help confirm whether something is satisfactory or unsatisfactory. Let's call these things that affect our judgment of the results "factors." Some factors provide stronger indications than others. When using these factors to rate results, we will assign higher scores (or "weightings") to the stronger indicators. By assigning higher weightings to the stronger indicators, we enable these to have a stronger influence on the overall outcome.

Some factors are positive, and when we detect them we are more likely to view the result as satisfactory. Others are negative and more likely to lead us to view the result as unsatisfactory.

As a real-life example of how factors affect our judgment, let's examine the way we judge a meal at a restaurant. Positive indicators might be the décor (a minor factor), the service (a medium factor), the taste of the meal (a strong factor), etc. Negative factors might be dirty crockery or, worse, an insect in the food.

For the airline ticket scenario mentioned in the summary, positive factors might include: structured HTML results with a section heading titled Flights (a minor factor), a Web form such as bookNow (a medium factor), and well-formatted results for the locations and dates requested (a strong factor). Negative indicators might be the lack of the Web form bookNow (a medium factor), or HTTP error codes like the dreaded "HTTP 500 server error" (a major factor). Sometimes the negative factor may override any or all of the positive indicators, such as the server error in this example. Some factors may be structural, such as the HTML elements, and others may be related to data, such as the details of the flights.

For test automation, we need to assign scores to the factors so we can codify them. A good way to start is to use a table (see table 1 below).

Table 1: Factors and Assigned Weightings

Indicator

Weighting

Range

Comments

HTML heading: Flights

Low

0 if missing, 1 if present

bookNow HTML form

Medium

-1 if missing, +2 if present

Well-formatted flight data

High

0 if missing, +5 if >= 1 result

We may want to consider having an upper limit, based on business rules

HTTP error codes

High

-10 for either 404 or 500 errorcodes



The ranges should be set so an overall negative score corresponds to unacceptable results and an overall positive score indicates acceptable results. A good way to assign weightings is to ask someone to describe the factors that help him decide whether a result is valid or trustworthy. We can also create mock-ups and suggest possible outcomes and see how they react. For instance, ask "how would you feel if the flight information was written as one long string?"

When we construct our tests, try to find out if we can "peek behind the curtain" to access the underlying data. If so, the data may help strengthen our tests by providing a stronger correlation between the results received and the results expected.

As always, we may choose to ignore some indicators, such as the time and date on the page. If so, record what you've ignored for future reference. This will help you and others to differentiate between these factors and whatever else is provided in the results.

Try to find ways to insert guards (or assertions, as they’re known in programming terminology) that validate your tests and help ensure that your tests detect major changes to the system. For example, if the airline ticket Web site's prices are currently limited to US dollars, put in a guard that will fail the tests if other currencies are detected. Deliberately aborting the tests is often preferable to blind execution, which might report erroneous results.

I recommend working with others to review the results of your initial tests to decide if your tests are trustworthy and to calibrate the weightings if need be. We need to focus on improving our tests. Look for both false positives (tests that claim to pass when they should fail) and false negatives (tests that claim to fail when they should pass). Whenever you find false positives or false negatives, review your test scripts and the weightings to find ways to make the tests more reliable and robust. Even after making improvements I still expect my tests to miss some problems. After all, they rely on imperfect heuristics; however that’s ok as long as they find the important problems!

Implementation Notes
Pattern matching is a practical way to implement matching the factors. Simple patterns might involve using string matching and field comparisons (for example to verify the HTTP response code field). More complex patterns might rely on regular expressions or structured queries (such as using XPATH expressions).

Field comparisons are useful when the data is structured and you are able to reliably predict where the relevant field will be in the response. For HTTP, the response code is easy to locate and easy to match. I recommend always adding a test for the value returned.

String matching is generally well supported and easier for non- technical users to work with. However, the string might be found elsewhere within the response, so take care to match the string in the correct part of the response.

Regular expressions are very powerful, particularly when matching predictable patterns of data, such as the flight details, within the response. Examples of regular expressions include: flight number, price, class (of ticket), source airport code, and destination airport code.

XPATH is useful for matching structured data and to hone in on the relevant section of data to compare within the response.

We can—and generally need to—combine the various pattern-matching techniques to determine whether the factor is present and "correct."

The scoring should be simple to code: If a pattern is matched correctly, return the respective score; otherwise return to zero. For cases such as the presence or absence of Web forms (such as bookNow), return the relevant score to +2 if found, otherwise -1. More complex scores can be implemented using case statements, etc. The value of some results will affect whether or not you evaluate other results. For instance if you detect an "HTTP 500" error you don't need to check the flight data on that page as the server has already reported an error.

By combining heuristic test oracles with the weighting of responses we can make our automated tests more powerful and reliable, even in volatile environments with complex dynamic data. The concepts are fairly simple, and you can work with experienced users to determine the factors and weightings. I hope you find these techniques useful. Let me know how you get on with these suggestions. Your comments and feedback will help to improve the material further.

Thanks to Mike Kelly and Harry Robinson who have already helped to improve this article.


About the Author
Julian Harty currently works as a software tester at Google on some really cool and innovative projects, which hopefully many of you will find useful and fun. He has over twenty years of experience in computing and is passionate about helping to make software work properly. He is also an international speaker and author.

Back to Top
 

StickyMinds.com Weekly Column From 2/12/2007 

Member Comments
Add Your CommentExpand Comments
 
Comment:    
by Ben Simo 3/22/2007

Hi Julian,

Great article.

We define a severity number (1-4) for each of the validations defined for our automated tests. (The validation/oracle definition is entirely separate from the action/application-control definition.) This allows us to configure each test run to only test validations of a specified severity or higher.

We also define a fail state or flag specific validations to identify that if they fail the application's state is not as expected. This then tells the execution engine to restart the application or reset the current state instead of continuing down the current...Read On

 
 
Comment:    
by Tim Fox 2/15/2007

Hi Julian
I liked your article (a friend passed it on), but I am a little confused - are you really improving the 'accuracy' of tests by weighing the results?
I understand result weighting to be an exercise that assists in the interpretation of results (it's impact/importance), rather than in the determination of results (PASS/FAIL/ERROR). As such, a testing exercise that is required to validate a system that has complex/dynamic results (such as your airline example) is still expected to present a result. It is at this point that you appear to be applying a weighting to the result that provides the test result with relative...Read On

 
 
Comment:    
by stephen kay 2/15/2007

Hi Julian,

Thanks for a well thought through article. It's well written, easy to understand and certainly has me thinking about how we could use these ideas.

At the moment our automated testing is very black and white. What you list as Indicators, we would see as checkpoints. We have no weighting, if it's there we pass, if it's not we fail. All the failures are investigated by real people who automatically apply heuristics to decide if it really is a failure and our checkpoints are updated accordingly.

I like the idea of using weighting to make our tests more accurate, but can't quite make the leap in my...Read On

Author's Response:
2/15/2007    
Steve,
I have used this approach for one project so far. The weighting was helpful for dealing with a couple of unpredictable factors in helping to assess whether there responses to requests were acceptable, or not... So the work is still experimental rather than well-proven/well established.

One reason I chose to write about the work was to find out, from others, whether these ideas would prove useful more broadly than in the spheres I currently work in.

Glad to see you online :)

 
 
Comment:    
by Sidney Snook 2/13/2007

Perhaps more specifics could be provided on how to decide on what the following are or what criteria do I use to determine these: "Indicators", "Weighting" and "Range"?

Author's Response:
2/13/2007    
Sidney,
Thanks for giving me the chance to clarify a few points:

I consider indicators and factors as virtual synonyms, the first less technical / less formal than the second. I might start by asking for indicators of whether a response is good or bad. As I add detail and clarity I would tend to change to using the term 'factor' to describe the indicator more precisely.

Similarly I consider weighting and range as two similar terms: weighting is the term I tend to use when working with non-technical users, it's less precise and is used to categorize. Range is used when it comes to implement (or codify) the heuristics.

The objective is to end up with an algorithm that will generate a larger score for responses we have confidence in (a large +ve number for good responses, a large -ve number for unacceptable responses). Scores close to zero tend to end up in a middle, grey zone. These are ones we are uncertain of. For these we should review the response and see if we think they deserve to be in either the positive or negative results. If so, perhaps we need to tweak the pattern matching or the scoring to generate an appropriately large score.

I'm aware this is hard for me to explain in pure text. Perhaps I'll be able to add a diagram to the article at some point. However I'm currently abroad and without such facilities. Please tell me whether I have answered your questions sufficiently.

Thanks

Julian

 
 
Comment:    
by Julian Harty 2/12/2007

Glad you liked the article :) if you get the opportunity to try out the concepts one one of your projects please let me know how well it works for you. I'm particlurly interested to find out how you end up scoring the results.

 
 
Comment:    
by Gerard Miller 2/12/2007

Hi Julian,

Nice article! I've never worked on this type of problem but I found the article interesting. The lessons of weighing the results are applicable elsewhere. For example a system with only one failed test sounds good until we find out that the problem is the application doesn’t load. :-} (Of course it’s also helpful to keep Pass/Fail/Untested statistics not just Pass/Fail.)

The vocabulary you used was understandable to this tester with one exception. I had to look on the web for a definition of XPATH.

Drawing an analogy between restaurant and airline ticket scenario is...Read On

 
Back to Top



 
Ads By Google
What's This?
 
 



Home   |   Resources   |   Topics   |   Community   |   PowerPass



© 2010 StickyMinds.com. All rights reserved.
StickyMinds.com is a division of Software Quality Engineering.
Privacy Policy    Terms & Conditions    Link to StickyMinds.com    Feedback


MicroFocus




STAREAST 2010 


Better Software Conference