Skip to main content

A Proactive Framework for Testing AI Agent and Application Quality

article
|
blueprint with pencil
Summary

Agentic AI's autonomy creates new security vulnerabilities that traditional methods cannot handle. Its independence expands the attack surface, allowing compromised agents to access sensitive data. Threats like prompt injection, which hijacks an AI's logic with malicious inputs, and chain-of-thought exploits that manipulate its reasoning, require proactive, adaptive defenses. Security must be designed into these systems from the ground up, using tools like agent-level firewalls and continuous behavioral monitoring to secure this new class of software.

The integration of AI agents and large language models (LLMs) into the enterprise is a present-day reality. These systems promise to revolutionize productivity, but this power introduces a new class of risks. Traditional testing methods are ill-equipped for challenges like model hallucinations, algorithmic bias, data drift, and prompt injection vulnerabilities.

We must shift from reactive, post-deployment bug hunting to a proactive, automation-first methodology that embeds quality throughout the AI lifecycle. This framework provides a practical, platform-agnostic approach to ensure the AI systems we build are not just functional, but also accurate, reliable, secure, and fair. It’s about building tangible trust in AI, enabling these systems to become dependable co-pilots for our organizations.

Universal Principles for Modern AI Testing

These core principles should guide your quality strategy, regardless of the platform or libraries you use.

  1. Test the Components, Then the Conversation
    An AI agent is a system of interconnected parts (data retrieval, tool use, generation). Before testing an end-to-end conversation, isolate and test each component. For a Retrieval-Augmented Generation (RAG) agent, rigorously validate the retriever's accuracy first. A faulty retriever almost guarantees a poor final response, and testing them together obscures the root cause. This modular approach makes debugging faster and more precise.
  2. Prioritize by Impact and Uncertainty
    Testing resources are finite. Focus your efforts where the combination of business risk and technical uncertainty is greatest. High-impact functions (financial, legal, or critical operational guidance) demand the most stringent testing. Similarly, invest more time in uncertain components, like an LLM's open-ended generation, than in predictable, deterministic ones like a standard API call.
  3. Use Metrics for a Baseline, Humans for the Nuance
    Automated metrics provide a consistent, objective baseline of quality, essential for catching regressions in a CI/CD pipeline. However, metrics alone cannot capture the subtleties of brand voice or contextual appropriateness. They can tell you if an answer is factually grounded but not if it's condescending. Reserve dedicated time for manual, exploratory testing by subject matter experts (SMEs) to find the critical issues automated tests will miss.
  4. Employ a Hybrid Evaluation Toolkit
    No single metric is best for AI evaluation. A modern tester must use a blended toolkit of classic NLP metrics and modern, LLM-based evaluation techniques. This hybrid approach is the only way to get a complete picture of quality, from grammatical correctness to factual accuracy.

The Hybrid Evaluation Toolkit: Your Testing Arsenal

Selecting the right metric for the right job is a critical skill. This table serves as a practical guide for building your evaluation suite with open-source libraries.
 

Metric Type

Specific Metrics

Tester's Guide: When to Use

How to Automate (Open-Source Libraries)

Classic NLP Metrics

BLEU, ROUGE

Best for tasks with predictable, structured outputs like summarization. Measures word overlap with a reference answer.

evaluate, nltk, rouge-score

BERTScore, METEOR

Ideal when semantic meaning should be similar, but exact wording can differ. Excellent for checking paraphrase quality.

evaluate (includes BERTScore), nltk

Modern LLM/Agent Metrics

Faithfulness / Groundedness

ESSENTIAL for any RAG agent. Verifies the agent is not hallucinating and is sticking to the provided source documents.

RAGAs, DeepEval, TruLens

Answer Relevancy

Does the answer directly address the user's question? Catches cases where the agent provides a factually correct but useless answer.

RAGAs, DeepEval, TruLens

Context Recall & Precision

Crucial for debugging RAG systems. Recall checks if all right documents were retrieved. Precision checks if retrieved documents were relevant.

RAGAs, DeepEval

Tool Utilization

For agents that use tools (APIs, calculators). Evaluates if the agent correctly chose and used a tool with the right parameters.

Custom logging and assertions


The Automation Engine: CI/CD for AI Quality

Integrating AI-specific testing into your CI/CD pipeline is non-negotiable for achieving velocity.

  1. Commit: A developer pushes a change (prompt, model config, tool code) to Git.
  2. Trigger: The commit automatically triggers a CI/CD pipeline (e.g., in Cloud Build).
  3. Build & Unit Test: The application is built, and fast, isolated component tests run.
  4. Deploy to Test Environment: The new agent is deployed to a dedicated testing environment.
  5. Run AI Quality Suite: A testing suite (e.g., using pytest) executes programmatic evaluation against your "Golden Dataset."
  6. Quality Gate: The pipeline checks results against predefined thresholds (e.g., faithfulness_score >= 0.8).
  7. Report & Deploy: A quality report is published. If the gate passes, the build is promoted for deployment.

Golden Dataset Management: Your Source of Truth

The "Golden Dataset" is a mission-critical asset for AI testing, collaboratively curated by SMEs, Product Owners, and QA.

  • Version Control: Store it as a .json or .yaml file alongside the application code in your repository.
  • Richness: The dataset must be rich, including a test_case_id, golden_answer_substrings, required_tools to verify logic, and key_context_phrases to ensure the right information was used.

Implementing the Framework: From Pre-Production to Production

Phase 1: Pre-Production Automated Regression Testing

This is your primary quality gate to catch regressions automatically before code reaches users. The process involves iterating through your Golden Dataset, sending each case to the agent, collecting the response, and running a hybrid evaluation.

Implementation Example: A "Tester Agent" in Python
This example uses a "Tester Agent"—a temporary Gemini-powered agent whose job is to run tests. When you give it a natural language command like "test the support agent," it calls a tool that executes the entire test suite against your "Agent Under Test" and then summarizes the results.

1. Prepare Your golden_dataset.json

This file is the source of truth for your tests.

[
      {
             "test_case_id": "TC001_policy_start_date",
             "question": "When does the standard PTO policy begin for new hires?",
             "ground_truth_answer": "The standard Paid Time Off policy for new hires begins on their first day of employment."
      },
      {
             "test_case_id": "TC002_report_submission",
             "question": "Who should I submit my weekly expense report to?",
             "ground_truth_answer": "You should submit your weekly expense report to your direct manager for approval."
      }
]

2. Create the Tester Agent Script (tester_agent.py)

import os
import json
import vertexai
from vertexai.generative_models import GenerativeModel, Tool
from google.cloud import aiplatform
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from evaluate import load as load_metric

# --- CONFIGURATION (Set as environment variables) ---
PROJECT_ID = os.getenv("GOOGLE_PROJECT_ID")
LOCATION = os.getenv("GOOGLE_AGENT_LOCATION", "global")
AGENT_UNDER_TEST_ID = os.getenv("AGENT_UNDER_TEST_ID")

# --- QUALITY THRESHOLDS ---
FAITHFULNESS_THRESHOLD = 0.75
RELEVANCY_THRESHOLD = 0.85
BERTSCORE_THRESHOLD = 0.88

# --- INTERNAL HELPER FUNCTIONS ---
def _query_agent_under_test(project_id: str, location: str, agent_id: str, query: str) -> tuple[str, list[str]]:
    """
    Simulates making an API call to the Vertex AI Agent to be tested.
    This is a simplified placeholder. Your actual implementation will use the Vertex AI SDK.
    It returns the agent's text response and a list of retrieved context strings.
    """
    # In a real scenario, you would use the aiplatform.gapic.PredictionServiceClient
    # to call your agent's endpoint and parse the response.
    print(f"  -> Querying agent with: '{query}'")
    # Placeholder response for demonstration
    response_text = "According to the handbook, you must submit reports to your direct manager."
    retrieved_contexts = ["The company handbook states all expense reports must be submitted to a direct manager for approval."]
    return response_text, retrieved_contexts

def _calculate_quality_metrics(question, agent_answer, contexts, ground_truth):
    """Calculates all quality metrics for a single test case."""
    ragas_dataset = Dataset.from_dict({
       "question": [question], "answer": [agent_answer],
       "contexts": [contexts], "ground_truth": [ground_truth]
    })
    ragas_result = evaluate(ragas_dataset, metrics=[faithfulness, answer_relevancy])
    bertscore_metric = load_metric("bertscore")
    bert_result = bertscore_metric.compute(predictions=[agent_answer], references=[ground_truth], lang="en")
  
    return {
       "faithfulness": ragas_result.get('faithfulness', 0.0),
       "answer_relevancy": ragas_result.get('answer_relevancy', 0.0),
       "bertscore_precision": bert_result['precision'][0] if bert_result.get('precision') else 0.0
    }

# --- THE MAIN TOOL FOR OUR TESTER AGENT ---
def run_agent_quality_evaluation() -> str:
    """
    Runs a full quality evaluation suite against a predefined Vertex AI agent using a
    'golden_dataset.json' file and returns a comprehensive JSON report.
    """
    print("🤖 Initializing quality test run...")
    if not all([PROJECT_ID, AGENT_UNDER_TEST_ID]):
       return json.dumps({"error": "Configuration Error: Env vars GOOGLE_PROJECT_ID and AGENT_UNDER_TEST_ID must be set."})

    try:
       with open('golden_dataset.json', 'r') as f:
           dataset = json.load(f)
    except FileNotFoundError:
       return json.dumps({"error": "File not found: `golden_dataset.json` is missing."})

    results = []
    summary = {"total_tests": len(dataset), "passed": 0, "failed": 0}

    for test_case in dataset:
       print(f"  - Testing case: {test_case['test_case_id']}...")
       agent_answer, contexts = _query_agent_under_test(
           PROJECT_ID, LOCATION, AGENT_UNDER_TEST_ID, test_case["question"]
       )
       scores = _calculate_quality_metrics(
           test_case["question"], agent_answer, contexts, test_case["ground_truth_answer"]
       )
       is_pass = (
           scores['faithfulness'] > FAITHFULNESS_THRESHOLD and
           scores['answer_relevancy'] > RELEVANCY_THRESHOLD and
           scores['bertscore_precision'] > BERTSCORE_THRESHOLD
       )
       summary["passed" if is_pass else "failed"] += 1
       results.append({
           "id": test_case['test_case_id'],
           "status": "PASS" if is_pass else "FAIL",
           "scores": {k: f"{v:.2f}" for k, v in scores.items()}
       })

    print("✅ Testing complete. Returning structured report.")
    return json.dumps({"summary": summary, "results": results}, indent=2)

# --- MAIN EXECUTION BLOCK ---
if __name__ == "__main__":
    if not PROJECT_ID:
       print("❌ Error: GOOGLE_PROJECT_ID environment variable not set.")
    else:
       vertexai.init(project=PROJECT_ID, location=LOCATION)
       agent_testing_tool = Tool.from_functions([run_agent_quality_evaluation])
       model = GenerativeModel(
           "gemini-1.5-pro-001",
           tools=[agent_testing_tool],
           system_instruction="You are a QA assistant. When asked to run a test, call the `run_agent_quality_evaluation` tool. After the tool returns a JSON report, summarize the findings clearly for the user. Start with the overall summary, then list any failing tests with their scores."
       )
       chat = model.start_chat()
       prompt = "Hello! Could you please run the standard quality check on our support agent?"
       print(f"🗣️  You: {prompt}")
       response = chat.send_message(prompt)
       print("🤖 Tester Agent:")
       print(response.text)

3. Run the Evaluation

First, set up your environment, then execute the script.

# Your Google Cloud Project ID
export GOOGLE_PROJECT_ID="your-gcp-project-id"
# The Agent ID of the agent you want to test
export AGENT_UNDER_TEST_ID="your-agent-builder-agent-id"

# Authenticate with Google Cloud
gcloud auth application-default login

# Install libraries and run
pip install --upgrade "google-cloud-aiplatform[generative_ai]" "ragas>=0.1.0" "datasets" "evaluate[bertscore]"
python tester_agent.py

Phase 2: Production Performance Testing & Monitoring with Google Cloud
  • To run and monitor a production AI agent on Google Cloud, you combine model selection, deployment, and a continuous feedback loop:
  • Select & Deploy: First, choose a foundation model from the Vertex AI Model Garden, such as a Gemini model (e.g., Gemini 1.5 Pro). Deploy your agent within Vertex AI to expose it as a scalable API endpoint for your production applications.
  • Log & Analyze: Once live, stream all agent interactions from Cloud Logging to BigQuery. This creates a structured dataset of every prompt, response, and tool call, which is essential for detailed analysis.
  • Automate Evaluation: Use Cloud Scheduler and Cloud Functions to periodically run evaluations on the live BigQuery data. This automated process calculates key quality metrics, including:
    • Tool Use Correctness: Tool Name Match, Tool Call Validity, and Tool Call Adherence to verify the agent selects and uses its tools accurately.
    • Response Accuracy: A Ground Check to ensure the response aligns with source data and a Fulfillment Check to confirm the user's request was successfully addressed.
  • Detect Drift Proactively: Implement Vertex AI Model Monitoring to compare live traffic against a baseline. This service automatically detects data drift, alerting you to performance risks before they impact users.
  • Visualize Health: Connect Looker Studio to your BigQuery data to create a real-time dashboard.
About The Author

Sharad is a distinguished technology leader renowned for driving digital transformation and groundbreaking innovation, with a profound specialization in Artificial Intelligence. He is at the forefront of AI research, particularly in the development of advanced AI agent frameworks, leveraging technologies like LLMs, machine learning, and computer vision on platforms such as Google Cloud Platform (GCP). Sharad has a proven history of architecting and deploying end-to-end AI-powered solutions that directly address complex business challenges, significantly boosting project profitability (e.g., through his work on agentic AI bots). His expertise extends to architecting contact center technologies powered by GCP Generative AI, showcasing his ability to bring bleeding-edge AI to practical, high-impact applications. Crucially, his comprehensive background includes leading multiple software testing projects on complex ERP systems (like SAP and Oracle), demonstrating a deep commitment to quality assurance, defect reduction, and efficiency gains (achieving 2-5x improvements). This holistic approach ensures the robust and reliable delivery of cutting-edge AI solutions, making him a highly sought-after advisor and team builder.

Community Sponsor

Lets Hang!

User Comments

0 comments

English