Testing a Moving Target: How Do We Test Machine Learning Systems?

Most machine learning systems are based on neural networks, or sets of layered algorithms whose variables can be adjusted via a learning process. These types of systems don’t produce an exact result; in fact, sometimes they can produce an obviously incorrect result. So, how do you test them? Peter Varhol tells you what you should consider when evaluating machine learning systems.

Testing systems that don’t always return the same answer requires new definitions and approaches.

Software testing is a fairly straightforward activity, in theory. For every input, there is a defined and known output. We enter values, make selections, or navigate an application, then compare the actual result with the expected one. If they match, we nod and move on. If they don’t, we possibly have a bug.

Granted, sometimes an output is not well-defined, there is some ambiguity, or you get disagreements about whether a particular result represents a bug or something else. But in general, we already know what the output is supposed to be.

But there is a type of software where having a defined output is no longer the case: machine learning systems.

Most machine learning systems are based on neural networks, or sets of layered algorithms whose variables can be adjusted via a learning process. The learning process involves using known data inputs to create outputs that are then compared with known results. For example, you may have an application that tries to determine an expected commute time based on the weather. The inputs might be temperature, likelihood of precipitation, and date, while your output is commute time for a set distance.

When the algorithms reflect the known results with the desired degree of accuracy, the algebraic coefficients are frozen and production code is generated. Today, this composes much of what we understand as artificial intelligence.

This type of software is becoming increasingly common, as it is used in areas such as e-commerce, public transportation, the automotive industry, finance, and computer networks. It has the potential to make decisions given sufficiently well-defined inputs and goals. To be precise, you need quantitative data. The inputs and expected output have to be able to be mathematically evaluated and manipulated in a series of equations. This could be as simple as network latency as an input, with likelihood of purchase as an output.

In some instances, these applications are characterized as artificial intelligence, in that they seemingly make decisions that were once the purview of a human user or operator.

These types of systems don’t produce an exact result. In fact, sometimes they can produce an obviously incorrect result. But they are extremely useful in a number of situations when data already exist on the relationship between recorded inputs and intended results.

For example, years ago I devised a neural network as a part of an electronic wind sensor. It worked though the wind cooling the electronic sensor based on its precise decrease in temperature at specific speeds and directions. I built a neural network that had three layers of algebraic equations, each with four or five separate equations in individual nodes, computing in parallel. They would use starting variables, then adjust those values based on a comparison between the algorithmic output and the actual answer.

I then trained it. I had more than five hundred data points regarding known wind speed and direction, and the extent to which the sensor cooled. The network I created passed each input into its equations through the multiple layers and produced an answer. At first, the answer from the network probably wasn’t that close to the known correct answer. But the algorithm was able to adjust itself based on the actual answer. After multiple iterations with the training data, the values should gradually home in on accurate and consistent results.

How do you test this? You already know what the answer is supposed to be, because you built the network using the test data, but it will be rare to get a correct answer all the time.

The product is actually tested during the training process. Training either brings convergence to accurate results, or it diverges. The question is how you evaluate the quality of the network. Here are the guidelines I used.

  1. Have objective acceptance criteria. Know the amount of error you and your users are willing to accept.
  2. Test with new data. Once you’ve trained the network and frozen the architecture and variables, use fresh inputs and outputs to verify its accuracy.
  3. Don’t count on all results being accurate. That’s just the nature of the beast. While the algebraic equations aren’t usually complex, there are many of them used in the network, which occasionally produces head-scratching results. You can’t explain it by following the logic, so you have to test and take an occasional bad result with the good. And if it’s not good enough, you may have to recommend throwing out the entire network architecture and starting over.
  4. Understand the architecture of the network as a part of the testing process. Few, if any, testers will be able to actually follow a set of inputs through the network of algorithms, but understanding how the network is constructed will help testers determine if another architecture might produce better results.
  5. Communicate the level of confidence you have in the results to management and users. Machine learning systems offer you the unique opportunity to describe confidence in statistical terms, so use them.

One important thing to note is that the training data itself could contain inaccuracies. In this case, because of measurement error, the recorded wind speed and direction could be off or ambiguous. In other cases, the cooling of the filament likely has some error in its measurement.

Here are some other important considerations:

  • You need test scenarios. Three may well be sufficient, to represent expected best case, average case, and worst case.
  • You will not reach mathematical optimization. We are, after all, working with algorithms that produce approximations, not exact results. Determine what levels of outcomes are acceptable for each scenario.
  • Defects will be reflected in the inability of the model to achieve the goals of the application.

Note that in these types of applications, the acceptance criteria aren’t expressed in terms of defect number, type, or severity. In fact, in most cases it is expressed in terms of the statistical likelihood of coming within a certain range. That evaluation of quality and risk isn’t a staple with most development and testing projects, so testers may be ill-equipped to consider it.

How can testers provide better feedback on their efforts on such applications? First, evaluate the application according to the acceptance criteria. Second, be prepared to support those assertions in statistical terms; for example, be 95 percent confident that the application will produce an answer within a given range. Last, have a high-level understanding of the underpinnings of the application, so that any deficiencies might be able to be ascribed to a particular application component.

Both testing practices and results have to change to accommodate applications that don’t behave the same as traditional software. If you find yourself working on machine learning and predictive applications, these suggestions may represent a good start in that direction. 

User Comments

malcolm chambers's picture

I was hoping for some new insight into this difficult area.  But the guidelines used are pretty much standard practice for the developers of such systems.  All of these guidelines would need to be applied to such a system before the system would be even considered ready to be Frozen and ready for production testing.  

While I am not currently active in this area at the moment I have been thinking and have done some analysis.  Testers I believe should be looking at things like

  • sensitivity analysis, How does the accuracy of the solution depend on variation in each of the input parameters. 
  • Alternative models is there a better solution.
  • Continued evaluation, add new validation data.
January 15, 2017 - 3:43pm
Peter Varhol's picture

Point taken.  This was my version 1.0 of getting back into machine learning after doing research on it as an academic. I thought I mentioned alternative models, and may be able to get more into that in the future, and I think what you call sensitivity analysis is what I call determining objectively if a given solution objectively meets requirements.

Still, a lot needs to be done here, and I'm hoping to contribute more on this and other venues this year.  I have a presentation on the topic that I'm giving this week at Software Quality Days, and hope to continue developing the topic.

January 16, 2017 - 4:35pm

About the author

StickyMinds is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.