Local RAG Implementation with .NET 9, Semantic Kernel, and Ollama

Authors
  • avatar
    Name
    Nino
    Occupation
    Senior Tech Editor

The landscape of Generative AI has shifted dramatically. While the initial wave was dominated by centralized APIs, the current trend is moving toward local execution and hybrid architectures. For developers, tools like ChatGPT and GitHub Copilot are indispensable, yet they come with significant caveats: dependency on external providers, per-token pricing, and the inherent risk of sending sensitive corporate data to the cloud. By leveraging n1n.ai, developers can bridge the gap between local testing and high-speed production APIs, but for the ultimate privacy-first approach, running models like Llama 3.2 locally is the gold standard.

This tutorial explores how to build a fully local Retrieval Augmented Generation (RAG) system using .NET 9, Semantic Kernel, and Ollama. This setup allows you to search through your own documents and generate answers without a single packet of data leaving your machine.

Understanding the RAG Architecture

Retrieval Augmented Generation (RAG) is a design pattern that addresses the fundamental limitation of Large Language Models (LLMs): they only know what they were trained on. If you ask a model about a document you wrote yesterday, it will hallucinate or admit ignorance. RAG solves this by providing the model with a "cheat sheet" of relevant context at the time of the query.

  1. Retrieval: When a user asks a question, the system searches a local database (or memory) for the most relevant text fragments.
  2. Augmentation: These fragments are injected into the prompt, giving the model temporary "domain knowledge."
  3. Generation: The LLM synthesizes an answer based on the provided context.

This approach is significantly more efficient than fine-tuning, as it allows for real-time data updates without retraining the model. For enterprises scaling these solutions, n1n.ai offers the necessary infrastructure to manage high-throughput LLM requests when local hardware reaches its limits.

Prerequisites and Setup

To follow this implementation, ensure your environment meets the following requirements:

  • .NET 9 SDK: The latest features in Semantic Kernel often target the newest framework versions.
  • Ollama v0.9.5+: The local engine that hosts and serves models like Llama 3.2.
  • Hardware: At least 4GB of VRAM (for GPU acceleration) or 8GB of system RAM for CPU-only execution.

First, initialize your environment by downloading the model:

ollama serve
ollama run llama3.2

Next, create a new .NET Console application and install the necessary NuGet packages. Note that the Ollama connector for Semantic Kernel is currently in a preview state, so you must enable the experimental feature flag.

dotnet new console -n LocalRAGDemo
cd LocalRAGDemo
dotnet add package Microsoft.SemanticKernel
dotnet add package Microsoft.SemanticKernel.Connectors.Ollama --version 1.34.0-preview

The Core Components

Our RAG application relies on three pillars:

  1. Semantic Kernel: The orchestration engine that connects the LLM, the memory, and the user input.
  2. Ollama: Our local inference server.
  3. Vector Store: For this demo, we use an in-memory store with Cosine Similarity, though production systems should consider Qdrant or Milvus.

Implementation: Document Ingestion and Vectorization

Before we can ask questions, we must turn our text documents into vectors (embeddings). An embedding is a numerical representation of meaning. If two sentences are semantically similar, their vectors will be close together in multi-dimensional space.

[Experimental("SKEXP0070")]
public static async Task<InMemoryDocumentStore> ImportDocumentsAsync(
    string path,
    ITextEmbeddingGenerationService embeddingService)
{
    var store = new InMemoryDocumentStore();
    var files = Directory.GetFiles(path, "*.md");

    foreach (var file in files)
    {
        var text = await File.ReadAllTextAsync(file);
        // Generate a vector for the document content
        var embedding = (await embeddingService.GenerateEmbeddingsAsync([text]))[0].ToArray();
        store.Add(text, embedding);
    }
    return store;
}

The Search Engine: Cosine Similarity

To find the "best" document, we calculate the Cosine Similarity between the user's question embedding and our stored document embeddings. This involves a dot product of the vectors divided by the product of their magnitudes.

private static double CosineSimilarity(float[] a, float[] b)
{
    double dot = 0, normA = 0, normB = 0;
    for (int i = 0; i < a.Length; i++)
    {
        dot += a[i] * b[i];
        normA += a[i] * a[i];
        normB += b[i] * b[i];
    }
    return dot / (Math.Sqrt(normA) * Math.Sqrt(normB) + 1e-5);
}

Orchestrating the Chat Loop

The heart of the application is the chat loop. It captures the user input, retrieves the context, and prompts the LLM.

We must be careful with prompt length. By setting a similarity threshold (e.g., 0.10), we only inject context if it is actually relevant. This prevents the model from getting confused by unrelated noise.

var chatService = ollamaClient.AsChatCompletionService();
var chatHistory = new ChatHistory("You are a helpful assistant specialized in technical documentation.");

do {
    string userQuery = Console.ReadLine();
    var queryVector = (await embeddingService.GenerateEmbeddingsAsync([userQuery]))[0].ToArray();

    // Find top 2 relevant documents
    var results = store.GetRelevant(queryVector, top: 2);

    var promptHistory = new ChatHistory(chatHistory);
    if (results.Any(r => r.Score > 0.10))
    {
        var context = string.Join("\n---\\n", results.Select(r => r.Content));
        promptHistory.AddSystemMessage($"Use this context: {context}");
    }

    promptHistory.AddUserMessage(userQuery);
    var response = await chatService.GetChatMessageContentAsync(promptHistory);
    Console.WriteLine($"AI: {response.Content}");
} while (true);

Pro Tips for Scaling Local RAG

While this in-memory approach is perfect for a laptop demo, enterprise-grade RAG requires more robustness. If you find your local hardware struggling, consider offloading specific tasks to n1n.ai. For example, you could use local embeddings for privacy while using a high-performance model like Claude 3.5 Sonnet via n1n.ai for the final generation step.

  1. Use Specialized Embedding Models: Llama 3.2 is an excellent generalist, but models like bge-small or nomic-embed-text are specifically optimized for vectorization and are much faster.
  2. Persistent Vector Databases: Replace the in-memory list with Qdrant. Running Qdrant is as simple as: docker run -p 6333:6333 qdrant/qdrant. This allows you to handle millions of documents with sub-millisecond search times.
  3. Chunking Strategy: Don't embed entire books. Break documents into 500-token chunks with a 50-token overlap to ensure the context remains coherent across fragments.

Conclusion

Building a local RAG system with .NET 9 and Semantic Kernel proves that you don't need a massive cloud budget to implement sophisticated AI. By keeping your data local, you eliminate latency jitter and data privacy concerns. As your needs grow, you can seamlessly transition to more powerful architectures or integrate managed services from n1n.ai to handle production loads.

Get a free API key at n1n.ai