The 5 Most Common RAG Failures (And How to Catch Them)

By Criterion Team

After analyzing thousands of RAG interactions across different industries, we've identified five failure patterns that account for 90% of production issues. Here's what to watch out for—and how to catch these problems before your users do.

1. The Context Contradiction

What it looks like: The AI generates an answer that directly contradicts the information in the retrieved context.

Real example:

Context: "Our premium plan includes unlimited API calls"
User question: "What are the API limits for premium?"
Bad answer: "Premium plan includes 10,000 API calls per month"

How to catch it: Implement faithfulness scoring that compares generated answers against source context. Any answer that contradicts factual statements should be flagged automatically.

2. The Relevance Drift

What it looks like: The AI provides accurate information, but it's not relevant to what the user actually asked.

Real example:

User question: "How do I cancel my subscription?"
Bad answer: "Our subscription service has been trusted by over 10,000 customers since 2020. We offer three tiers of service with different features..."

How to catch it: Use answer relevancy metrics to ensure responses directly address the user's intent. Look for verbose answers that bury the actual answer in unnecessary context.

3. The Retrieval Miss

What it looks like: Your retrieval system pulls irrelevant documents, leading to answers based on wrong information.

Real example:

User question: "What's your refund policy?"
Retrieved context: Blog post about company culture
Bad answer: "We believe in putting our employees first and creating a positive work environment..."

How to catch it: Measure context precision by evaluating whether retrieved documents actually contain information relevant to the query. Poor retrieval is often the root cause of downstream failures.

4. The Confidence Fake-Out

What it looks like: The AI sounds confident but provides completely fabricated information.

Real example:

User question: "Who is your head of engineering?"
No relevant context found
Bad answer: "Our head of engineering is Sarah Johnson, who joined us in 2022 with extensive experience in distributed systems."

How to catch it: Implement uncertainty detection. When retrieval quality is low or no relevant context is found, the system should acknowledge limitations rather than fabricating answers.

5. The Update Regression

What it looks like: System changes break previously working queries without anyone noticing.

Real example:

After model update, the system stops correctly answering basic questions about pricing
Customer support starts getting confused calls about "wrong information on the website"
Nobody realizes the AI system is the source of confusion

How to catch it: Maintain a regression test suite with known good query-answer pairs. Run these tests after every system update to catch performance degradations immediately.

Building Your Detection System

Here's a practical approach to catching these failures:

1. Automated Scoring Pipeline

Set up automated evaluation for every response:

Faithfulness score (context contradiction detection)
Relevancy score (answer-question alignment)
Context precision (retrieval quality)

2. Threshold-Based Alerting

Define quality thresholds:

Faithfulness < 0.7: High risk
Relevancy < 0.5: Poor user experience
Context precision < 0.3: Retrieval failure

3. Continuous Monitoring

Track key metrics over time:

Average scores per day/week
Distribution of score ranges
Regression test performance

4. Human Review Triggers

Automatically flag responses for human review when:

Multiple metrics are below threshold
High-stakes queries (legal, financial, medical)
User feedback indicates problems

The ROI of Early Detection

Catching these failures early pays dividends:

Reduced support tickets: Fewer confused users contacting support
Higher user satisfaction: Consistent, accurate responses build trust
Compliance confidence: Meet regulatory requirements for AI systems
Development velocity: Faster iteration with automated quality gates

Prevention vs. Detection

While detection is crucial, prevention is even better:

Quality data curation: Better source documents lead to better answers
Retrieval optimization: Tune your search to find more relevant context
Prompt engineering: Design prompts that encourage faithful, relevant responses
Model selection: Choose models that balance capability with reliability

Getting Started

Start simple:

Pick one metric (we recommend faithfulness)
Evaluate a small sample of responses manually
Build automated scoring for that metric
Set up basic alerting
Expand to other metrics gradually

Remember: perfect evaluation is less important than consistent evaluation. A simple system that runs reliably is better than a complex system that gets ignored.

Ready to implement systematic RAG evaluation? Join our private beta and get access to production-ready evaluation tools that integrate with your existing workflow.