The tooling ecosystem for LLM evaluation has matured fast. Ragas, DeepEval, PromptFoo, LangSmith; these are serious frameworks, widely adopted, and genuinely useful. If your engineering team isn’t using at least some of them, they probably should be.
But here’s the problem enterprises keep running into: they assemble a solid set of developer tools, run their evals, pass their benchmarks and then their AI assistant still fails in production. Not because the tools lied. Because each tool covers a specific layer of the problem, and no single one covers the conversation as a customer actually experiences it.
This post is about understanding the full testing stack, what each layer does, why it matters, and where the gaps are.
The testing stack, layer by layer
A mature enterprise testing practice for conversational AI spans five distinct layers. Each one answers a different question. Skipping any of them means carrying risk you can’t see.
Layer 1 - Component and unit testing
Tools: DeepEval, PromptFoo
This is where developers live. Unit tests for LLMs; individual prompts validated against expected outputs, metrics computed at the response level, regression checks run as part of CI/CD pipelines.
DeepEval is the closest thing the LLM world has to pytest. It offers 60+ metrics: faithfulness, answer relevancy, contextual precision and plugs directly into automated build pipelines. PromptFoo takes a complementary angle, focusing on prompt regression and security: does your prompt behave consistently across model versions? Does it hold up against adversarial inputs?
These tools are fast, developer-friendly, and essential for catching regressions early. They operate on single turns, single prompts, single components. That’s their strength and their limit.
What they don’t cover: How the assistant behaves across a full, multi-turn conversation with a real user.
Layer 2 - RAG pipeline evaluation
Tools: Ragas, LlamaIndex Evaluators
If your assistant retrieves context before generating a response, which most enterprise chatbots do; you need to evaluate the retrieval layer separately. Is the right content being retrieved? Is the generated response actually grounded in what was retrieved, or has it wandered?
Ragas was built specifically for this. Its metrics like Context Precision, Context Recall, Faithfulness, Answer Relevancy, evaluate the retrieval-augmented generation pipeline as a system, not just the final output. It’s research-backed and highly targeted.
This layer is critical for knowledge-base chatbots, internal assistant deployments, and any scenario where the assistant is expected to cite or reflect specific approved content.
What it doesn’t cover: Whether the conversation, end to end, feels coherent and trustworthy to the person having it.
Layer 3 - Observability and debugging
Tools: LangSmith, Arize Phoenix
Once your assistant is running (in staging, in a limited rollout, or in production) you need visibility into what’s happening inside it. Which prompts are being triggered? Where is latency accumulating? What does the reasoning chain look like for a specific failure?
LangSmith is the natural choice for teams building on LangChain. It captures full traces, allows dataset versioning, and enables human review of flagged interactions. Arize Phoenix covers similar ground with a stronger emphasis on drift detection and production monitoring.
These tools are excellent for debugging known issues and tracking model behaviour over time. They require backend access, traces, logs, chain internals and are primarily used by engineers.
What they don’t cover: Pre-deployment validation from the user’s perspective, with no access to the backend required.
Layer 4 - Red teaming and adversarial testing
Tools: PromptFoo (red team mode), Garak
A dedicated adversarial pass before go-live. Jailbreak attempts, prompt injection, boundary probing, regulatory stress tests. This layer asks: what happens when someone tries to break it?
PromptFoo’s red team mode generates adversarial test cases automatically. Garak (by NVIDIA) is purpose-built for LLM vulnerability scanning, probing for toxicity, hallucination under pressure, data leakage.
This is non-negotiable for public-facing deployments in regulated industries, and increasingly expected by enterprise procurement and compliance teams.
What it doesn’t cover: The everyday failure modes that aren’t adversarial, context loss, subtle hallucination, poor conversation flow, which erode trust slowly rather than catastrophically.
Layer 5 - End-to-end conversational testing
Tools: Hangar 5
This is the layer that sits furthest from the code and closest to the customer. It asks the question none of the other tools answer: what happens when a real person has a real conversation with your assistant?
Not a single prompt. Not a retrieved chunk. Not a traced API call. A full, multi-turn dialogue; the kind where users change topic halfway through, reference something they said three messages ago, or phrase their question in a way no test case ever anticipated.
Hangar 5 connects to your AI assistant the same way a real user does via chat interface, voice channel, or API endpoint, with no access to prompts, weights, or backend systems. It simulates thousands of realistic conversations across diverse personas, phrasings, and edge cases, then scores every dialogue on three dimensions: Relevance (did it answer what was asked?), Grounding (was it factually accurate?), and User Experience (was it a good interaction?).
Crucially, every conversation is recorded in full, transcript and video replay; so when something fails, you’re not reading an error code. You’re watching it happen.
This layer sits outside the development stack by design. It doesn’t require engineering involvement to run. QA leads and programme managers can configure and execute test cycles independently, which means it fits into the pre-deployment sign-off process rather than the developer workflow.
How the layers work together
These tools don’t compete. They each answer a different question, and a mature enterprise testing practice uses all of them.
| Layer | Question answered | Who runs it | When |
|---|---|---|---|
| Unit / component | Does each prompt behave correctly? | Engineers | During development |
| RAG evaluation | Is retrieval and grounding working? | Engineers / ML team | During development |
| Observability | What’s happening inside the system? | Engineers | Staging & production |
| Red teaming | Can it be broken deliberately? | Security / QA | Pre-deployment |
| E2E conversation | How does it behave with real users? | QA / Programme Managers | Pre-deployment & regression |
The pattern that leads to production failures is almost always the same: strong investment at layers one and two, partial coverage at three and four, and nothing at five. The assistant passes its unit tests. It scores well on RAG metrics. It survives red teaming. And then a customer has a three-turn conversation where context quietly falls apart, and the assistant confidently gives them the wrong answer.
The developer tools couldn’t catch it - they don’t run full conversations. The observability tools didn’t flag it - it wasn’t an anomaly, just a bad interaction. The red team didn’t find it; it wasn’t adversarial, just realistic.
Layer five exists to catch exactly that.
The practical implication for enterprise teams
If you’re building out an evaluation practice from scratch, the instinct is to start with the most developer-accessible tools; DeepEval, Ragas, PromptFoo. That’s the right instinct. Those tools belong in your stack.
But the question to ask before go-live is a different one: have we tested this the way a customer will use it? Not a prompt. A conversation. Thousands of them. Across the full range of things real users say and do.
That’s the evidence that gives QA leads something to sign off on. That’s what gives programme managers something to show leadership. And in regulated industries, that’s the difference between documented assumptions and defensible assurance.
The tools are there. The stack is complete. The only question is whether you’re using all of it.