RAG in the Real World: Why Scalable AI Needs More Than Just Retrieval and Prompts

Get In Touch

Executive Summary

Retrieval-Augmented Generation (RAG) has quickly become one of the most glorified terms in enterprise AI. RAG is typically showcased over a handful of PDFs which seems to be easy and simple. But operating at enterprise scale—lakhs (hundreds of thousands) of records, low-latency retrieval, strict accuracy, and predictable costs—is hard. Real limits show up in four places: embeddings, vector indexes (e.g., Azure AI Search), retrieval/filters, and LLMs themselves. You’ll need hybrid architectures, careful schema/ops, observability, and strict cost controls to get beyond prototypes.

Embedding Challenges at Scale

Cost & volume. Embedding large datasets quickly runs into scale issues. Even a moderately sized corpus—each record carrying a few hundred to thousands of tokens—translates into tens of millions of tokens overall. The embedding phase alone can run into significant dollar costs per cycle, and this expense only grows as data is refreshed or re-processed.

Throughput & reliability. APIs enforce token and RPM limits. Production pipelines need:

  • Token-aware dynamic batching
  • Retry with backoff + resume checkpoints
  • Audit logs for failed batches

Chunking trade-offs.

  • Too fine → semantic context is lost
  • Too coarse → token bloat + noisy matches Use header/paragraph-aware chunking with small overlaps; expect to tune per source type.

Domain relevance gaps. General-purpose embeddings miss subtle, domain-specific meaning (biomedical, legal, financial). Dimensionality isn’t the cure; domain representation is. Without specialization, recall will feel “lexical” rather than truly semantic.

Vector Databases: Strengths… and Constraints for RAG

Immutable schema. Once an index is created, fields can’t be changed. Adding a new filterable/tag field → recreate the index.

Full reloads. Schema tweaks or chunking updates often require re-embedding and re-indexing everything—expensive and time-consuming (parallelism will still hit API quotas).

Operational sprawl. Multiple teams/use cases → multiple indexes → fragmented pipelines and higher latency. Unlike a DB with views/joins, AI Search pushes you toward rigid, static definitions.

(Reference: Microsoft docs on Azure AI Search vector search and integrated vectorization.)

Retrieval & Filtering Limits

Shallow top-K. Even at top-K, relevant items can fall just outside the cut-off. In regulated domains, a single miss matters.

Context window pressure. As best practice we should send only required fields, then join externally on predicted Answers which will have key identifiers Like ID to finish the answer. (In our Scenario we sent 1-2 columns having Id, descriptions out of 25 columns in a table)

Filter logic ceiling. Basic metadata filters work, but you’ll miss nested conditions, dynamic role-based filters, and cross-field joins that are trivial in SQL.

LLM Limitations You Will Hit

Input limits. Even with large-context models (450k+ tokens),However, even within this boundary, we observed that certain records — particularly those in lower relevance ranges may be implicitly skipped during reasoning .This behaviour is non-deterministicand poses serious challenges for enterprise-grade tasks where every data point is critical

Output limits.Many models cap useful output (e.g., 2k–16k tokens). This affects multi-record responses, structured summaries, and complex decision-making outputs — leading to truncated or incomplete responses, particularly when returning JSON, tables, or lists.

Latency & variance. Complex prompts over hundreds of records can take minutes. Stochastic ranking creates run-to-run differences—tough for enterprise SLAs.

Concurrency & quotas. Enterprises face quota exhaustion due to shared token pools, Concurrent usage by multiple users or batch agents can quickly consume limits Smaller organizations with access to 8K input / 2K outputtoken models (e.g., LLaMA, Mistral) face even tighter ceilings — making RAG challenging beyond pilot projects

Productionisation, Observability — and Evaluation That Actually Matters

While most RAG demos end at a “right-looking” answer, real deployments must be observable, traceable, resilient, and continuously evaluated. GenAI pipelines often underinvest in these layers, especially with multiple async retrievals and LLM hops.

Operational pain points

  • Sparse/unstructured LLM logs → hard to reproduce issues or inspect reasoning paths.
  • Thin vector/AI Search telemetry → silent filter failures or low-recall cases go unnoticed.
  • Latency tracing across hybrids (SQL + vector + LLM) is messy without end-to-end spans.
  • Failure isolation is non-trivial: embed vs ranker vs LLM vs truncation?

What good observability looks like

  • End-to-end tracing (embed → index → retrieve → rerank → prompt → output).
  • Structured logs for: retrieval sets & scores, prompt/response token usage, cost, latency, confidence, and final joins.
  • Quality gates and alerts on recall@K, latency budgets, cost per query, and hallucination/citation signals.

Evaluation: beyond generic metrics

  • Partner with domain experts to define what “good” means. Automatic scores alone aren’t enough in regulated or domain-heavy settings.
  • Build a domain ground-truth set (gold + “acceptable variants”) curated by SMEs; refresh it quarterly.
  • Establish a human-in-the-loop loop: double-blind SME review on sampled traffic; escalate low-confidence or low-evidence answers by policy.
  • Maintain an error taxonomy (missed retrieval, wrong join, truncation, unsupported query, hallucination) with severity labels; track trends over time.
  • Run canaries/A-B tests in prod; compare quality, latency, cost, and SME acceptance before full rollout.
  • Log evaluation metadata (query id, versioned index, chunking config, model/runtime version) so you can pinpoint regressions.

In practice (our pattern)

  • We co-defined a gold set with SMEs and require evidence-backed answers; low-evidence responses are auto-routed to review.
  • We track recall@K + citation coverage for retrieval, and field-level precision/recall for extraction tasks, alongside cost/latency dashboards.

Bottom line Without observability + domain-grounded evaluation, production RAG stays brittle and opaque. With them, you get a system you can debug, trust, and scale—not just a demo that looks good once.

Conclusion: Beyond the Buzzwords

While Retrieval-Augmented Generation (RAG) has gained mainstream attention as the future of enterprise AI, real-world adoption reveals a wide gap between expectation and execution. From embedding inconsistencies and rigid vector schema constraints to LLM context bottlenecks and high operational latency, the challenges compound rapidly at production scale. Even in enterprise environments with access to high-token models, limitations around completeness, determinism, and runtime stability remain unresolved.

This doesn’t mean RAG is fundamentally flawed—it’s a powerful paradigm when paired with the right retrieval tuning, agent orchestration, hybrid pipelines, and system-level observability. But as engineers and architects, it’s time we shift the conversation from aspirational posts to grounded, production-aware designs. Until embedding models evolve to be truly domain-specific, vector systems allow dynamic schemas, and LLMs deliver predictable performance at scale, RAG should be treated not as a plug-and-play solution—but as a custom-engineered pipeline with domain, data, and budget constraints at its core

This blog represents the collective strength of our AI team—where collaboration, innovation, and expertise come together to create meaningful insights. It is a testament to the value SNP delivers through its AI practice, showcasing how we help our customers turn possibilities into impact.

Subscribe To The Your Newsletter

For Our Latest News And Insights