RAG 9 min read

RAG at Scale: Indexes, Vector Search, and Streaming Updates

Priya Desai

Priya Desai

April 5, 2026

Vector database and retrieval

The retrieval bottleneck in agent systems

In production, RAG pipelines fail less because of model quality and more because of retrieval quality under load. As document volume and tenant count grow, latency spikes and context relevance drops. The fix starts with index strategy, not prompt tweaks.

Data structures behind fast retrieval

  • Inverted indexes: strong lexical recall for exact entities and compliance keywords.
  • ANN graphs (HNSW): low-latency semantic nearest-neighbor search.
  • Hash-based sharding: route tenant data predictably for balanced partitioning.
  • B-Trees / LSM indexes: keep metadata filters fast for time, region, and source constraints.

Hybrid retrieval architecture

Stage 1: candidate generation

Run BM25 and vector search in parallel. Merge top-k candidates with reciprocal rank fusion to maximize recall.

Stage 2: re-ranking

Use a cross-encoder or lightweight reranker model to prioritize evidence quality and freshness.

Stage 3: context assembly

Apply diversity constraints to avoid duplicated snippets and fit token budget by weighted chunk selection.

Streaming updates with Flink

Batch re-indexing is too slow for agent workflows that depend on fresh documents. With Apache Flink, you can consume change events, re-chunk documents, recompute embeddings, and push index updates continuously.

  1. Source CDC events from operational databases.
  2. Normalize and deduplicate in Flink keyed streams.
  3. Generate embeddings and metadata features.
  4. Upsert into vector and lexical indexes with version tags.

Practical scaling rules

  • Separate cold and hot indexes by recency.
  • Cache reranker features in Redis for repeated queries.
  • Use query-time filters before ANN to reduce candidate explosion.
  • Store chunk lineage to explain every generated answer.

Metrics that actually matter

  • P95 retrieval latency per tenant and route.
  • Answer-grounding rate (percent of answers with verifiable citations).
  • Index freshness lag (source update to searchable update).
  • Top-k recall at evidence level from labeled eval sets.

"Great RAG systems are won in indexing and retrieval pipelines long before prompt engineering starts."

- Priya Desai