Testing a Moving Target: How Do We Test Machine Learning Systems?

Most machine learning systems are based on neural networks, or sets of layered algorithms whose variables can be adjusted via a learning process. These types of systems don’t produce an exact result; in fact, sometimes they can produce an obviously incorrect result. So, how do you test them? Peter Varhol tells you what you should consider when evaluating machine learning systems.

Testing systems that don’t always return the same answer requires new definitions and approaches.

Software testing is a fairly straightforward activity, in theory. For every input, there is a defined and known output. We enter values, make selections, or navigate an application, then compare the actual result with the expected one. If they match, we nod and move on. If they don’t, we possibly have a bug.

Granted, sometimes an output is not well-defined, there is some ambiguity, or you get disagreements about whether a particular result represents a bug or something else. But in general, we already know what the output is supposed to be.

But there is a type of software where having a defined output is no longer the case: machine learning systems.

Most machine learning systems are based on neural networks, or sets of layered algorithms whose variables can be adjusted via a learning process. The learning process involves using known data inputs to create outputs that are then compared with known results. For example, you may have an application that tries to determine an expected commute time based on the weather. The inputs might be temperature, likelihood of precipitation, and date, while your output is commute time for a set distance.

When the algorithms reflect the known results with the desired degree of accuracy, the algebraic coefficients are frozen and production code is generated. Today, this composes much of what we understand as artificial intelligence.

This type of software is becoming increasingly common, as it is used in areas such as e-commerce, public transportation, the automotive industry, finance, and computer networks. It has the potential to make decisions given sufficiently well-defined inputs and goals. To be precise, you need quantitative data. The inputs and expected output have to be able to be mathematically evaluated and manipulated in a series of equations. This could be as simple as network latency as an input, with likelihood of purchase as an output.

In some instances, these applications are characterized as artificial intelligence, in that they seemingly make decisions that were once the purview of a human user or operator.

These types of systems don’t produce an exact result. In fact, sometimes they can produce an obviously incorrect result. But they are extremely useful in a number of situations when data already exist on the relationship between recorded inputs and intended results.

For example, years ago I devised a neural network as a part of an electronic wind sensor. It worked though the wind cooling the electronic sensor based on its precise decrease in temperature at specific speeds and directions. I built a neural network that had three layers of algebraic equations, each with four or five separate equations in individual nodes, computing in parallel. They would use starting variables, then adjust those values based on a comparison between the algorithmic output and the actual answer.

I then trained it. I had more than five hundred data points regarding known wind speed and direction, and the extent to which the sensor cooled. The network I created passed each input into its equations through the multiple layers and produced an answer. At first, the answer from the network probably wasn’t that close to the known correct answer. But the algorithm was able to adjust itself based on the actual answer. After multiple iterations with the training data, the values should gradually home in on accurate and consistent results.

How do you test this? You already know what the answer is supposed to be, because you built the network using the test data, but it will be rare to get a correct answer all the time.

User Comments

malcolm chambers's picture

I was hoping for some new insight into this difficult area.  But the guidelines used are pretty much standard practice for the developers of such systems.  All of these guidelines would need to be applied to such a system before the system would be even considered ready to be Frozen and ready for production testing.  

While I am not currently active in this area at the moment I have been thinking and have done some analysis.  Testers I believe should be looking at things like

  • sensitivity analysis, How does the accuracy of the solution depend on variation in each of the input parameters. 
  • Alternative models is there a better solution.
  • Continued evaluation, add new validation data.
January 15, 2017 - 3:43pm
Peter Varhol's picture

Point taken.  This was my version 1.0 of getting back into machine learning after doing research on it as an academic. I thought I mentioned alternative models, and may be able to get more into that in the future, and I think what you call sensitivity analysis is what I call determining objectively if a given solution objectively meets requirements.

Still, a lot needs to be done here, and I'm hoping to contribute more on this and other venues this year.  I have a presentation on the topic that I'm giving this week at Software Quality Days, and hope to continue developing the topic.

January 16, 2017 - 4:35pm

About the author

StickyMinds is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.