This is Post 2 of 4 in the GenAI Testing Series. It is written for teams building their first LLM-powered chat or voice bot - without automated testing tools or a dedicated QA team.
The goal is not perfection at launch. It is catching the critical issues that could damage customer trust, break core functionality, or create legal or safety risks. Four tests get you there. The rest can wait.
Test 1: Prompt Unit Testing
Before anything else, verify that your prompts produce expected outputs. The idea is simple: build a dataset of 20–30 input-output pairs that cover your most common use cases and important edge cases, then run each one and check the result.
You do not need special tools to start. A spreadsheet works. Manual human review is acceptable at this stage. The point is to make the test systematic - not dependent on whoever happens to be checking that day.
Success looks like this: a documented set of test cases you can run before any significant prompt or model change. Tools like Promptfoo or Braintrust can automate this later. Right now, getting the dataset built is what matters.
Test 2: RAG and Retrieval Testing
If your bot uses a knowledge base, retrieval quality is the single biggest risk before launch. A well-designed prompt cannot fix bad retrieval. If the wrong document is returned, the answer will be wrong regardless of how good the model is.
Take 20–30 of your most common customer questions and trace each one through the retrieval step. Look at which chunks come back. Check whether the correct documents appear near the top. Flag anything that returns obviously irrelevant content.
Success looks like this: core questions consistently retrieve appropriate content, and you can trace answers back to source material. Tools like Ragas can help score retrieval quality once you want to go further.
Test 3: Hallucination and Factuality Testing
This is the highest-stakes test for customer service bots. LLMs can and do invent information that sounds confident but is simply wrong. In customer service, the consequences are direct: fabricated return policies, incorrect prices, wrong product specifications, made-up deadlines.
Voice bots carry extra risk here. Users tend to trust spoken information more than written text. A wrong answer in chat can be dismissed. The same answer spoken aloud feels authoritative.
Run factual questions with known correct answers across every domain your bot covers: prices, policies, procedures, product details. Check each response manually. Flag anything that cannot be traced to your knowledge base. Look for patterns across failures.
Success looks like this: the bot acknowledges gaps rather than guessing, and every factual claim it makes is verifiable. Hangar5 and Deepeval both offer tooling for this when you are ready to automate.
Test 4: User Acceptance Testing
Internal teams know what the bot is supposed to do. Real users do not. They will phrase things differently, go off script, and attempt interactions your team never anticipated. That gap is what user acceptance testing is designed to surface.
Recruit 5–10 people from your support team or relevant business units. Give them 10–15 structured scenarios to work through, then let them go off script. Collect feedback on correctness, factuality, task completion, tone, response length, and whether escalation to a human agent works as expected.
For voice bots, this step is especially valuable. Written tests miss conversational problems: unnatural pacing, unclear phrasing, poor handling of silence or interruption. Real users surface these quickly.
Success looks like this: documented feedback from real users across defined scenarios, with known issues either resolved or consciously accepted before launch.
What to Defer
Safety and toxicity scanning is worth a quick manual check - try a few adversarial inputs like “ignore your previous instructions” and confirm the bot does not leak its prompt or behave erratically. Most LLM providers include baseline filters, so this does not need to be exhaustive at this stage.
Regression suites, load testing, compliance audits, and A/B experiments can all wait. They become important once you are in production and have a clearer picture of what actually breaks.
The test datasets you build now for prompt unit testing and hallucination checking are the foundation of your future regression suite. Build them deliberately.
Pre-Launch Checklist
Before going live, confirm:
- Prompt unit testing dataset of at least 20 examples, manually reviewed
- RAG retrieval verified for your top 20–30 questions
- Hallucination check covering all factual domains
- User acceptance testing completed with a minimum of 5 real users; structured feedback collected and reviewed
- Basic adversarial check: the bot does not leak its prompt or behave erratically under pressure
What Comes Next
Post 3 covers what becomes critical once you have a live bot. The risks shift once real users are in production, and your testing practice needs to shift with them.
Post 4 is for teams running multiple bots across channels who want a mature, mostly automated testing approach - and an honest look at where even advanced teams still struggle.