AI-based tools have transformed from a vague, futuristic vision into actual products that are used on a day-to-day basis to make real-life decisions. Still, for most people, the inner workings of deep-learning systems remain a mystery.
If you don’t know what exactly is going on while the input data is fed through layer after layer of a neural network, how are you supposed to test the validity of the output? Are the days of simple tests with a clear and understandable result over?
First of all, let’s make a clear distinction between testing applications that consume AI-based outputs and testing the actual machine learning systems.
If your application falls into the first category, there’s no need to worry—or to change your approach to testing. AI-based third-party tools don’t require any VIP treatment; they can be viewed as black boxes, just like “regular” deterministic third-party products you might use. Focus your effort on testing whether your own products behave correctly when presented with an output from the AI.
But what about the companies that create these machine learning systems? How do you go about verifying that they actually do what they should?
If we’ve learned anything about AI and machine learning over the last decade, it’s that it’s all about data, and lots of it. This data plays a central part in your testing strategy.
The most commonly used approach is to divide the data that’s available to you into three parts: a training set, a development set, and a test set. To understand how to test your AI, you’ll first need to know how these three sets play together to train a neural network.
When you develop a deep learning system, you work by feeding huge amounts of data in the form of a clearly defined input and expected output or result into a neural network. Then, you wait for the network to come up with a set of mathematical formulas that work best to calculate the correct expected output for most of the data points you provide it with.
Let’s say you’re in the process of creating an AI-based tool that detects cancerous cells in X-ray images of patients’ lungs. These images, preprocessed to be computer-readable, are your input data, and each image has a defined output, or the expected result. That’s the training set.
Trying Out the Algorithm
Once the network has been busy optimizing for some time, you will want to check how well it’s doing with its newly learned formulas. Your training algorithm already outputs how well it’s doing on the training examples, meaning the data you’ve been feeding it all this time. However, using this number to evaluate the algorithm is not a good idea.
Chances are the network will detect cancer correctly in the images it’s seen many times, but that’s no indicator of how it will perform on other images like the ones it will see in production. Your cancer detection algorithm will only get one chance to assess each image it sees, and it needs to predict cancer reliably based on that.
So the real question is, how does the algorithm perform when presented with completely new data that it hasn’t been trained on?
This new data set is called the development set, because you tweak your neural network model based on how well the trained network performs on this set. Simply put, if the network performs well on both the training set and the development set (which consists of images it isn’t optimized for because they were not part of the training set), that’s a good indicator that it will also do well on the images it will face day to day in production.
If it performs worse on the development set, your network model needs some fine-tuning, followed by some more training using the training set and, finally, an evaluation of the new, hopefully improved performance using the development set. Often you will also train several different networks and decide which one to use in your released product based on the models’ performances on the development set.
Choosing Dev and Test Data Sets
At this point you might ask yourself, isn’t that testing? Well, not really.
Feeding the development set into your neural network can be compared to a developer trying out the new features they’ve built on their machine to see if they seem to work. To thoroughly test a feature, though, a fresh pair of eyes—most commonly a test engineer—is required to avoid biases. Similarly, you’ll want to use a fresh, never-used data set to verify the performance of your machine learning system, as these systems become biased as well.
How does a computer become biased? As described above, during development you tweak your model based on the results it gets on the development set, so by definition you will choose the model that works best with this specific data set. For our cancer detection example, if the development set coincidentally consisted mostly of images showing earlier stages of cancer and healthy patients, the network would have troubles dealing with images showing later stages of cancer, because you chose the network model that doesn’t perform best for those circumstances.
Of course you should try to use well-balanced training and development sets, but you won’t really know if you managed to do that without using a completely new data set to test the final algorithm. The network’s performance on the test set is the most reliable indicator of how it will perform out in the real world.
For that reason, it’s important to choose a test set that resembles the data your AI will receive in production as closely as possible. For the cancer detection algorithm, that means choosing a variety of images of different qualities, with different sections of the body, from different patients. These images have to be labeled as correctly as possible as cancerous or not cancerous. Now, for the test, you simply have to let the algorithm assess all the test examples and compare the algorithm’s output to the expected output. If the percentage of correctly assessed images is satisfying, the test is successful.
Those of you who are experienced testers will certainly ask, what does “satisfying” mean in terms of those results? In traditional testing, the answer is usually quite clear: The output should be correct for all test cases. However, this will hardly be possible when it comes to machine learning algorithms, especially for complex problems such as cancer detection. So to come up with a concrete number, the best place to start is to look at how qualified humans perform at that exact task.
For our cancer detection example, you’ll want to assess the performance of trained doctors—or, if you want to aim even higher, of a team of world-renowned experts—and use that as your goal. If your AI detects cancer as well or better than that, we can consider the test results satisfactory.
Risk-based Testing in the World of AI
Up until now, we’ve been talking about the percentage of correctly assessed images as the metric to look at in the test results. In other words, you’d evaluate your deep-learning algorithm based on how many healthy patients were diagnosed as cancerous and how many ill patients as healthy. However, these two things are not the same in the real world.
If the AI decides that a healthy patient has cancer, more tests will be performed and the patient will eventually be sent home if the other tests don’t indicate any problems. Apart from a major health scare, all will be well. If, on the other hand, a patient who indeed has cancer is sent home based on an incorrect assessment, they will lose invaluable time to start their treatment. Their chances of being cured might be much worse when the cancer is finally detected than they would have been had the algorithm assessed their X-ray correctly in the first place.
For that reason, you will need to decide which weight to place on false positives and false negatives. Similar to risk-based testing of non-AI tools, the decision on whether to release your product in its current state even though some test cases might fail depends on the risk associated with the failing test case. Sending a healthy patient in for more tests is a low risk; sending a sick patient home is a potentially deadly risk.
Ruling Out Data Biases
Another important part of testing deep-learning systems is bias testing. Because neural networks base their decisions strictly on the data they are trained on, they run a risk of mimicking biases we see when humans make decisions, since these biases are often reflected in data sets that were collected.
Let’s go back to our cancer detection example. When doctors assess X-ray images, they also know the patient’s history, so they might unconsciously pay more attention to a lifelong smoker’s image than to a young, non-smoking patient’s, so they might therefore be more likely to miss lung cancer in the latter patient’s X-ray.
If you use the doctor’s diagnosis to label the expected results for your data set, this bias will likely be transferred to your algorithm. Even though the network won’t get any additional information about the patient, lungs of smokers and non-smokers certainly have differences, so the network might link the look of a non-smoker’s lung to a negative cancer test result and fail to detect cancer in these images.
To rule out biases in neural networks, you’ll need to carefully analyze the test results—especially the failures—and try to find patterns. For example, you could compare the algorithm’s success rate for smokers’ and non-smokers’ images. If there is a noticeable difference, the algorithm might have become biased during training. If there is any reason to suspect a bias, you’ll need to perform additional exploratory tests with tailored data sets to confirm or disprove your suspicion.
The Right Tools
These complexities might lead you to conclude that you’ll need highly specialized tooling to test your deep-learning system. However, rest assured that most of the hard work is taken over by the AI developers.
Weight calculations, data processing, and result evaluation are already woven into the neural network during the development process, as they are required right from the beginning. Once the neural network is built, you can pass any data set into it and it will output the result, along with the overall accuracy of said result. All there’s left to do is to switch your development set with your test set and look at your network’s performance. No new tools are required for that.
It’s All Still Testing
Testing AI systems is not that different from testing deterministic tools. While there are big differences in the details, it’s still the same process: Define your requirements, assess the risk associated with failure for each test case, run your tests, and evaluate whether the weighted, aggregated results are at or above a defined level. Then add some exploratory testing into the mix to find bugs in the form of biased results. It’s not magic; it’s just testing.
Nice article. So if I detect my NN has a bias, how can it be corrected to remove the bias without impacting the overall performance? Does this assume NN can evolve over time with correction?
This was a stimulating read. One thing I note, however, is that what is described is still a) black-box and b) manual testing, which is typically the most time-consuming and expensive sort of testing. As we increase our use of NNs we will likely have to find more cost-effective ways of doing this, like training an AI to test an AI ;)
Kerstin, great article. Determining test cases and data sets is a huge challenge when dealing with advanced, relatively less explored technologies such as AI.