Docs/Test Automation/Overview

Overview

Conversation Simulation helps you validate how your agent behaves before users are impacted. Instead of waiting for production feedback, you can run realistic test conversations, measure quality against clear criteria, and improve safely.

Most agent issues are not technical failures. They are quality failures:

The answer is factually incomplete.
The tone is inconsistent with your persona.
The response misses required points for compliance or support policy.
The answer quality drops after changing AI models.

Simulation addresses these risks by making quality visible and repeatable.

Key Benefits

Benefit	What it gives you
Pre-release confidence	Validate behavior before publishing changes
Repeatable quality checks	Re-run the same scenarios over time
Faster iteration	Identify weak responses quickly, then improve prompt/knowledge/settings
Model migration safety	Compare quality before and after model changes
Clear evidence for teams	Use run results as objective quality proof

Simulation and Model Switching

Changing a model can improve speed or cost, but it can also change response style, reasoning depth, and consistency. A model switch is safe only if conversation quality stays good for your real use cases.

Use Simulation as your regression guardrail:

Run your core scenarios on the current model and store results.
Switch model in draft configuration.
Re-run the same scenarios.
Compare criteria match quality and response quality patterns.
Publish only when critical scenarios remain strong.

This process helps you keep conversations reliable even when the underlying model changes.

Recommended Workflow

Follow this sequence for beginner-friendly, reliable adoption:

Create and organize categories.
Design realistic conversation scenarios.
Run simulations and monitor execution status.
Analyze criteria outcomes and improve your agent.
Repeat after any major update (prompt, resources, or model).

Best Practices

1) Design scenarios from real user intent

Use actual user requests, support tickets, and high-risk questions. Synthetic examples are useful, but real patterns reveal real risk.

2) Keep each conversation focused

A scenario should test one main objective. If it tries to test everything, failures become hard to diagnose.

3) Write measurable criteria

Criteria should be specific and observable. Prefer "must mention refund window" over "should be helpful".

4) Build a baseline before model changes

Always capture a baseline run before changing the model. Without baseline data, quality comparison is subjective.

5) Re-run critical scenarios routinely

Do not test once and forget. Re-run priority scenarios after changes to behavior settings, resources, custom APIs, plugins, or model.

6) Use results to drive targeted fixes

When a criterion fails, decide where to improve:

Persona and behavior instruction
Dialog settings
Resource quality or coverage
Tool or plugin usage rules

Learn the Process Step by Step

Quick Start Checklist

Use this checklist if you want to start fast:

Create at least 3 categories for your top use cases.
Build 5 to 10 scenarios for high-frequency and high-risk questions.
Add criteria for every conversation step.
Run a baseline simulation.
Apply improvements and run again.
If switching model, compare baseline vs new model before publishing.

NextSet Up Categories