Building a Production-Ready RAG System with Incremental Indexing
- Authors

- Name
- Nino
- Occupation
- Senior Tech Editor
Retrieval-Augmented Generation (RAG) has rapidly evolved from a experimental concept to the standard architecture for enterprise AI applications. By connecting Large Language Models (LLMs) to custom knowledge bases, developers can provide models like Claude 3.5 Sonnet or DeepSeek-V3 with the specific, up-to-date context they need to reduce hallucinations and provide accurate answers. However, as these knowledge bases grow from a few dozen files to thousands of documents, a major bottleneck emerges: document management.
In a standard RAG setup, adding or updating a single document often requires re-indexing the entire collection. This is not only slow but also prohibitively expensive when using high-performance APIs. To build a truly production-ready system, you need a way to manage updates efficiently. By utilizing n1n.ai, developers can access a wide range of LLMs to power the generation phase, but the underlying data pipeline must be optimized first. This tutorial explores how to implement incremental indexing to solve these scaling challenges.
The Problem with Traditional RAG Pipelines
Most RAG tutorials follow a simple pattern: load documents, split them into chunks, create embeddings, and save them to a vector store. While this works for a small demo, it fails in production for several reasons:
- Redundant Processing: If you have 1,000 documents and you change one word in one file, a naive pipeline re-processes all 1,000 files. This wastes CPU cycles and API credits for embeddings.
- Deletion Blindness: If a file is deleted from your source folder, traditional pipelines often leave the old chunks in the vector store, leading to outdated or incorrect answers.
- Latency: Re-indexing a large corpus can take minutes or hours, making it impossible to keep the AI's knowledge in sync with real-time data changes.
The Solution: Incremental Indexing with SQLRecordManager
The solution is to implement an "indexing ledger." This ledger tracks the state of every document currently in your vector store. When you run a sync, the system compares the current files on disk with the ledger and only performs actions on the differences.
We will use LangChain's SQLRecordManager and the index() function. This setup allows for three critical operations:
- Add: Only process new files.
- Update: Only re-process files whose content has changed (detected via hashing).
- Delete: Remove chunks from the vector store if the source file no longer exists.
Implementation Guide
To get started, you will need a vector database (like Chroma or Pinecone) and an embedding model. For the generation phase, using a stable API aggregator like n1n.ai ensures you can switch between models like GPT-4o or OpenAI o3 without changing your retrieval logic.
1. Configuration and Setup
First, define your environment variables and constants. We use nomic-embed-text for local embeddings and SQLite for our record manager.
# Configuration
CHROMA_PATH = "chroma_db"
RECORD_DB_PATH = "sqlite:///record_manager_cache.sql"
SOURCE_FOLDER = "./Knowledge"
EMBEDDING_MODEL = "nomic-embed-text"
COLLECTION_NAME = "production_rag"
CHUNK_SIZE = 600
CHUNK_OVERLAP = 100
2. The Vector Store and Record Manager
The SQLRecordManager acts as the brain of our synchronization logic. It stores hashes of your document chunks to determine if they need updating.
from langchain.indexes import index
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
from langchain.indexes import SQLRecordManager
def get_vector_store():
embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL)
return Chroma(
collection_name=COLLECTION_NAME,
persist_directory=CHROMA_PATH,
embedding_function=embeddings
)
def sync_knowledge_base():
vectorstore = get_vector_store()
# The namespace prevents collisions between different collections
record_manager = SQLRecordManager(
namespace=f"chroma/{COLLECTION_NAME}",
db_url=RECORD_DB_PATH
)
record_manager.create_schema()
# Load and split documents
loader = DirectoryLoader(SOURCE_FOLDER, glob="**/*.*", loader_cls=TextLoader)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP
)
docs = loader.load_and_split(text_splitter)
# The Indexing Magic
indexing_stats = index(
docs,
record_manager,
vectorstore,
cleanup="full", # This ensures deleted files are removed
source_id_key="source"
)
return indexing_stats
How the Indexing Logic Works
When the index() function is called, it performs a series of checks:
- Hasing: It calculates a hash for every chunk of text in your documents.
- Comparison: It checks the SQLite database to see if that hash already exists for that specific
source_id. - Delta Application:
- If the hash is new, it adds it to the vector store.
- If a
source_idin the database is missing from your currentdocslist, it deletes those chunks from the vector store (becausecleanup="full"is set). - If the hash matches, it does nothing.
Performance Comparison
In a production environment with 5,000 documents, the difference is staggering.
| Operation | Traditional Re-Indexing | Incremental Indexing | Time Saved |
|---|---|---|---|
| Adding 1 new file | 12 minutes | 4 seconds | > 99% |
| Updating 1 file | 12 minutes | 7 seconds | > 99% |
| No changes | 12 minutes | 2 seconds | 100% |
By reducing the workload, you free up resources to focus on the quality of the generation. This is where n1n.ai comes in. Once your retrieval is fast and accurate, you can pipe that context into the latest models available on n1n.ai to get the best possible answers for your users.
Pro Tips for Production RAG
- Source Identification: Always ensure your metadata contains a unique
sourcekey (like a file path or URL). This is what the record manager uses to track document identity. - Chunk Size Tuning: For technical documentation, a
CHUNK_SIZEof 800-1000 with 15% overlap usually yields better results than smaller chunks, as it preserves more structural context. - Cleanup Strategy: Use
cleanup="full"with caution. If your document loader fails to read a directory, it might assume all files are deleted and wipe your vector store. Always include error handling in your loading logic. - Embedding Latency: If you are processing millions of chunks, consider using a hosted embedding service via n1n.ai to parallelize the workload beyond what a local machine can handle.
Conclusion
Moving from a basic RAG script to a production-ready system requires shifting from "state-less" processing to "state-aware" synchronization. By implementing incremental indexing, you ensure your knowledge base is always current, your costs are minimized, and your system can scale to meet enterprise demands.
Once your data pipeline is optimized, the quality of your RAG system depends on the intelligence of the LLM. Access the world's most powerful models with low latency and high reliability through a single interface.
Get a free API key at n1n.ai