RAGDebuggingBest Practices

The 5 Most Common RAG Failures (And How to Catch Them)

By Criterion Team

After analyzing thousands of RAG interactions across different industries, we've identified five failure patterns that account for 90% of production issues. Here's what to watch out for—and how to catch these problems before your users do.

1. The Context Contradiction

What it looks like: The AI generates an answer that directly contradicts the information in the retrieved context.

Real example:

  • Context: "Our premium plan includes unlimited API calls"
  • User question: "What are the API limits for premium?"
  • Bad answer: "Premium plan includes 10,000 API calls per month"

How to catch it: Implement faithfulness scoring that compares generated answers against source context. Any answer that contradicts factual statements should be flagged automatically.

2. The Relevance Drift

What it looks like: The AI provides accurate information, but it's not relevant to what the user actually asked.

Real example:

  • User question: "How do I cancel my subscription?"
  • Bad answer: "Our subscription service has been trusted by over 10,000 customers since 2020. We offer three tiers of service with different features..."

How to catch it: Use answer relevancy metrics to ensure responses directly address the user's intent. Look for verbose answers that bury the actual answer in unnecessary context.

3. The Retrieval Miss

What it looks like: Your retrieval system pulls irrelevant documents, leading to answers based on wrong information.

Real example:

  • User question: "What's your refund policy?"
  • Retrieved context: Blog post about company culture
  • Bad answer: "We believe in putting our employees first and creating a positive work environment..."

How to catch it: Measure context precision by evaluating whether retrieved documents actually contain information relevant to the query. Poor retrieval is often the root cause of downstream failures.

4. The Confidence Fake-Out

What it looks like: The AI sounds confident but provides completely fabricated information.

Real example:

  • User question: "Who is your head of engineering?"
  • No relevant context found
  • Bad answer: "Our head of engineering is Sarah Johnson, who joined us in 2022 with extensive experience in distributed systems."

How to catch it: Implement uncertainty detection. When retrieval quality is low or no relevant context is found, the system should acknowledge limitations rather than fabricating answers.

5. The Update Regression

What it looks like: System changes break previously working queries without anyone noticing.

Real example:

  • After model update, the system stops correctly answering basic questions about pricing
  • Customer support starts getting confused calls about "wrong information on the website"
  • Nobody realizes the AI system is the source of confusion

How to catch it: Maintain a regression test suite with known good query-answer pairs. Run these tests after every system update to catch performance degradations immediately.

Building Your Detection System

Here's a practical approach to catching these failures:

1. Automated Scoring Pipeline

Set up automated evaluation for every response:

  • Faithfulness score (context contradiction detection)
  • Relevancy score (answer-question alignment)
  • Context precision (retrieval quality)

2. Threshold-Based Alerting

Define quality thresholds:

  • Faithfulness < 0.7: High risk
  • Relevancy < 0.5: Poor user experience
  • Context precision < 0.3: Retrieval failure

3. Continuous Monitoring

Track key metrics over time:

  • Average scores per day/week
  • Distribution of score ranges
  • Regression test performance

4. Human Review Triggers

Automatically flag responses for human review when:

  • Multiple metrics are below threshold
  • High-stakes queries (legal, financial, medical)
  • User feedback indicates problems

The ROI of Early Detection

Catching these failures early pays dividends:

  • Reduced support tickets: Fewer confused users contacting support
  • Higher user satisfaction: Consistent, accurate responses build trust
  • Compliance confidence: Meet regulatory requirements for AI systems
  • Development velocity: Faster iteration with automated quality gates

Prevention vs. Detection

While detection is crucial, prevention is even better:

  1. Quality data curation: Better source documents lead to better answers
  2. Retrieval optimization: Tune your search to find more relevant context
  3. Prompt engineering: Design prompts that encourage faithful, relevant responses
  4. Model selection: Choose models that balance capability with reliability

Getting Started

Start simple:

  1. Pick one metric (we recommend faithfulness)
  2. Evaluate a small sample of responses manually
  3. Build automated scoring for that metric
  4. Set up basic alerting
  5. Expand to other metrics gradually

Remember: perfect evaluation is less important than consistent evaluation. A simple system that runs reliably is better than a complex system that gets ignored.


Ready to implement systematic RAG evaluation? Join our private beta and get access to production-ready evaluation tools that integrate with your existing workflow.