AI/ML Testing and Your Test Future (Part 2)


AI, AI, AI—it is everywhere. We all read this in the news, see it in politics, in our web traffic, and now it’s coming to our tools. Interestingly, testing has been identified as one of the most essential areas for AI as well as for the safety of the public. This article considers references to start you on your AI test journey, "classic AI" problem areas, and identifies possible concepts to use when testing AI. As usual, it comes down to a willingness to learn new things or apply historical ideas to advance your test career.

Introduction AI and the Human Thinkers

It has been a few years since my first AI article in StickyMinds, and AI has now grown into a hot topic for industry, government, the media, and tech geeks. I remain optimistic and think there is an excellent future for testers who up their game to be AI testers. This follow-up article addresses where I think things may be going. It is not an end but the next small step in a long journey, likely over many years of testing AI. There will be changes and lessons learned yet to come.

I have seen information that says, "AI testing following classic concepts is not viable." Jason Arborn warns anyone that following cults or dogma in testing is not a great idea. So, in reverse order to what you may be thinking, I present to you some references you should consider, starting with how to follow and improve your AI test thinking. Here are the references and the people I follow, since we are all still learning about AI:

On the government reference, it is likely one of many AI "rules" to come. In these moves, testing made the list of important AI activities, including independent testing. However, these pronouncements are vague, and many items are not defined, such as “What does the government mean by independent testing and who pays?” We do not know yet, but the other reference demonstrates that the industry is working on AI and AI testing.

And finally, I have words of thought from AI itself (ChatGPT 3.5).

In the article, the portrayal of the government and society can vary depending on their actions and responses to the challenges posed by AI. They are not strictly categorized as inherently good, bad, or neutral characters but as entities that exhibit various behaviors and perspectives.

Some members of the government and society (companies and techie) may be depicted as positive characters who recognize the importance of responsible AI development. They actively engage with AI ethics, collaborate with experts, and support initiatives to create friendly AI systems. These characters advocate for transparency, accountability, and the well-being of individuals in the face of advancing technology.

Conversely, individuals within the government or society may initially underestimate (or downplay) the potential risks of AI or prioritize short-term gains over long-term consequences. They may display skepticism, resistance to change, or fail to grasp the significance of ethical considerations in AI development. These characters represent the challenges and barriers emerging in adopting responsible AI practices.

The narrative can also highlight the collective efforts of government and society to navigate the complexities of AI ethics. It can demonstrate the evolution of their understanding and actions over time as they become more aware of the importance of friendly AI and work towards implementing regulations, guidelines (currently found in ISO and IEEE efforts), and collaborations to ensure AI's responsible development and deployment.

The nuanced portrayal of the government and society aims to reflect the diverse perspectives and responses in real-life contexts. It encourages readers to consider the role of governance and societal engagement in shaping the future of AI. It prompts them to reflect on the importance of collective responsibility in creating (and testing) a positive and ethical AI-powered world.

Even the AI themselves are still "learning". To continue, Jason Arbon wrote: “To properly assess AI, we should use sampling, statistical methods, and metrics similar to how search engines are tested.” And a final reference to consider is this article on testing AI systems.

My List of Starting Points for AI Testing

You have some homework to do if you do not know or recognize the test ideas listed in the section. This article is not a tutorial on these as they are detailed in many test books, websites, and standards. Also, this is my current list. It is not comprehensive, proven by history, and should be viewed as only an infant's starting point with all the mistakes an infant makes. The list is interesting because I found it conflicted with various authors, standards, schools of testing, and my thinking when I researched it. I present the list as a compendium without debating whether a listed item is a pure technique, a concept, an approach, a phase, or some other name that standards such as ISO/IEC/IEEE 29119 carefully define. However, given the state of AI testing, I think I instead see a test idea identified but will worry about its classification acceptance once we get long-term AI test results on the current and future generations of AI. Also, I think the items on the list will change as we practice more AI testing.

Given this information, readers should consider the list, use it as a starting point to look up the details of implementing the AI test concept, and then prove me wrong. The list's core is Verification and Validation (V&V) Assessment and Independent V&V (IV&V) a summary of which comes from IEEE 1012, a standard that focuses on verification and validation (V&V) processes for systems, software, and hardware that are being developed, maintained, or reused, including AI. V&V processes include dynamic testing, static analysis testing, demonstration, analysis, evaluation, review, inspection, simulation, evaluation assessment, formal methods, model verification checks, and testing of products. (ref: IEEE 1012-2016 - IEEE Standard for System, Software, and Hardware Verification and Validation)

1. Development AI/ML level V&V and IV&V
     a. Sanity Testing: agile quick tests
     b. Inspect and apply static analysis (tools) to logic as it is developed
     c. Risk analysis: part of all testing and test planning
     d. Input data testing: assess the training and test data for completeness, bias, sufficiency, and quality
          i. Data and distribution-based verification using Bayesian Methods
          ii. Calibration of an AI model
     e. Developer structural testing of the non-AI parts of the system
     f. Developer AI system structural-based tests
          i. Test coverage measurement
          - Note: it is not clear what is possible
     g. Scenario testing: Use-cases, specification-requirement checks (when they exist), story testing, technology-facing a test, business case-facing testing, feature assessment
     h. Metamorphic test design
     i. Combinatorial test design
     j. Static analysis of the AI training and test data

2. Software interface level V&V and IV&V
     a. Test interfaces within the AI system when they are identified
     b. Test interfaces between the AI and traditional software systems

3. Software and system level V&V and IV&V: Testing the integration between different AI modules, software support subsystems, and hardware.
     a. Continuous Integration (CI) Testing: Automating tests in the CI pipeline.
     b. Continuous Deployment (CD) Testing: Automating deployment and verification.
     c. Demonstration of the released system with no special test tooling
          i. Chaos Engineering
          ii. Conduct security assessments (see below)
          - Note: the following sub-list generally applies to testing, V&V and IV&V

4. Regression Testing: Ensuring new code and AI model changes do not adversely affect existing functionality.
     a. A/B Testing: Comparing the performance of two or more AI models
     b. Differential Testing: Comparing the outputs of different AI models (versions) on the same input
     c. Standard test suite tests, including a standard set of stress tests

5. Performance and size testing: Evaluating the AI system's speed, responsiveness, and scalability
     a. Stress Testing: Assessing the AI system's performance under normal and extreme conditions
     b. Load Testing: Measuring how well the AI system handles concurrent users or data loads
     c. Scalability Testing: Testing the AI system's performance under increased workload
     d. Temporal Testing: Assess the AI model's ability to handle time-dependent data
     e. Memory Usage Testing: Checking the AI system's memory footprint and size usage
     - E.g. Model Compression assessment: Evaluating model compression techniques

6. Usability testing: Assessing the AI system's user interface and user experience

7. Security assessment: Identifying and fixing vulnerabilities in the AI system and support software
     a. Adversarial Testing: Evaluating the AI system's resistance to adversarial attacks
     b. Privacy Testing: Ensuring the AI system protects user privacy
     c. Fuzz Testing: Injecting random data to test the AI system's resilience
     d. Access/Password Cracking
     e. Assess access to sensitive data
     f. Pen testing
     g. Software attacks to break the software security
     h. Assess for Malware when code reuse is practiced
     i. Spoofing attack
     j. Taint analysis
     k. Denial of service attack
     l. Tempering attack
     m. Social engineering attack

8. Exploratory Testing: Discovering defects and edge cases by exploring the AI system
     a. Attack-based testing of the AI and software system (see books by Hagar and Whittaker)
     b. Experience-based test design
     c. Error guessing
     d. Random-statistical test inputs to the AI model

9. Model validation analysis: Assess the AI model accurately represents the real-world problem.
     a. Simulation analysis

10. Interoperability Testing: Ensuring the AI system can interact with other systems or (various or many) APIs

11. Conformance Testing: Testing whether the AI system adheres to relevant standards and regulations (industry and government)

12. Fault Injection Testing: Introducing faults to test the AI system's robustness

13. Model-based testing (MBT) with automation test generation (millions of tests?)
     a. Keywords and automated test execution engines
     b. AI testing system: An AI system that tests the AI system under test

14. AI product quality assurance (QA) and dependability assessment – See IEEE 982.1 (to be published next year)
     a. Controllability
     b. Functional adaptability
     c. Functional (very high level) correctness
     d. Robustness: Assessing the AI model's stability under different conditions
     e. Transparency
     f. Reliability
     g. Dependability
     h. Safety and hazard ops
     i. Resilience testing: Evaluating the AI system's ability to recover from failures
     j. Data quality analysis: Ensuring the data used for training and testing is high quality with minimized bias
     k. Data augmentation analysis: Checking the effectiveness and quality of the data used in training and testing
     l. Bias quality analysis: Detecting and mitigating (reducing) bias in AI models
     m. Mutation analysis: Checking the AI system's ability to process mutated data inputs (training and testing)
     n. Online and ongoing dynamic AI/ML learning (the system "learns when present”)
          i. Backward compatibility testing: Ensuring new model versions are backward compatible as learning takes place
          ii.Online learning testing: Assessing the AI model's ability to learn from streaming input data
          iii. Model quantization testing: Verifying the accuracy of quantized AI models
          iv. V&V over the whole lifecycle by Dev, Ops, and third-party teams (as required)
          v. Uncertainty analysis of risks – assess what things are unknown
          vi. Causal analysis

Many of these test ideas will change and evolve for AI/ML as the industry matures.

This is just a list of food for thought, but if someone tells you they have all of the answers on AI testing, they are just probably selling something. Those of us working on this list are still discussing and learning, and still defining. The list and definitions are incomplete and, at best, in a state of evolution.

If a V&V/test concept is unfamiliar to you, you have to work to find information and read to get more of an introduction.

AI Test Career Growth and Learning Advice

If you know some or many of these test concepts, you may be able to start AI testing. Please, do it now.

Between AI and security testing, there are opportunities and jobs to be had. As the first article implied, it may be time to retire. If you do nothing but stay just a manual follow-the-script tester, here is what I do almost daily.

  • Learn—I have been in software testing for over 40 years, and I still have much more to learn.
  • Train and practice—Find and practice testing AI.
  • Support exploratory and attack testing - be context-driven, think, observe, test, repeat, and get out there and do it.
  • Support standards—might conflict if you have a context-driven viewpoint, but you should move outside of the dogma that standards are "bad."
  • Be a positive optimist—but run scared of AI going to the “dark side.”

If you want to grow, consider joining some of the groups on the list. It may take being in the right company or joining the proper organization, but the industry needs help and besides, we cannot leave everything to the government.

Some Things to Worry About in Your AI Testing

Here are some of the pitfalls that we all should be aware of during AI engineering and testing. This is not a complete list, but it gets you started.

  • Alignment problem—do the goals of the AI/ML conflict with the human users and operators?
    Note: Misalignment yields more danger and risks in AI.
  • Benefit vs Risk: The AI/ML benefits need to be greater than the risk that a "bad" AI/ML might pose. It is our job to reduce risks and make AI a benefit to humankind.
  • Bias—(See future standard IEEE 7003). All data and AI/ML are biased, especially when dealing with minority or small training/test sets. Testers need to be ethicists, data tacticians and mathematicians with the human touch so that AI can become successful and minimize bias.
  • Many users do not understand the risks and possibilities of AI models. Testers need to be skeptical FOR our users.
  • Automate, Automate, Test, and while doing these, you still need to think.
  • Testers need to mirror social considerations—New technology leaps will change society but should not replace humans, because we all want to avoid an AI doomsday.
  • Finally, know about tabula rasa in the AI training starting point.

Note: tabula rasa is a Latin phrase roughly translating to "blank slate". It suggests that people are born with no innate ideas, so their minds are shaped by their experiences. However, AI/ML may start as a blank slate, but the training set and data inputs quickly change this, and as an AI/ML learns, it may get less blank and become more biased. In testing, we need to remember this concept.

User Comments

Mike Emeigh's picture

We have to get out of our heads - and help our community get out of its collective heads - the Given/When/Then/Assert model of testing that (sadly) all too many places have accepted as the way that testing "should" be done. We have to recognize that AI models by their nature are non-deterministic and that we won't generally be able to assert that "any" outcome produced by an AI model is as it should be.

Testing AI should be treated as a voyage of discovery. We set out to discover what we can about the model and the underlying data patterns, charting our course using Jon's starting points, and assess where we are and where we need to go at each stage of the journey, and what dragons may lie ahead. We're rarely going to know whether we're in the "right" place or the "wrong" place, but we can do our best to assess the essential nature of place we are in.

December 19, 2023 - 5:15pm
Jon Hagar's picture

For me, testing and testing AI is/will be all about creative thinking using a large background of knowledge. As you indicate, right and wrong are subjective and not always absolute.  AI will have a problem with this lack of absolutes for the foreseeable future.  Humans work at their best this way. Exploratory testing plays on this.  Learn as we go and where we are at each stage, ready to find a dragon. 

December 20, 2023 - 5:00pm

About the author

StickyMinds is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.