Building RAG Applications with Azure Cognitive Search and .NET

Summary: This post explores how to implement Retrieval-Augmented Generation (RAG) applications using Azure Cognitive Search and .NET. Learn how to create a powerful document search and Q&A system that combines the knowledge from your own data with the capabilities of large language models.

Introduction

As large language models (LLMs) like GPT-4 become increasingly powerful, developers are finding innovative ways to enhance these models with domain-specific knowledge. One of the most effective approaches is Retrieval-Augmented Generation (RAG), which combines the power of search technology with generative AI.

RAG addresses several key limitations of pure LLM approaches:

Knowledge cutoff: LLMs only know information up to their training cutoff date
Hallucinations: LLMs can sometimes generate plausible-sounding but incorrect information
Domain-specific knowledge: Organizations need AI systems that understand their proprietary data
Data freshness: Information changes rapidly, and models need access to the latest data

In this post, we’ll explore how to build a RAG application using Azure Cognitive Search (recently enhanced with vector search capabilities) and .NET. We’ll create a system that can ingest documents, index them for efficient retrieval, and use Azure OpenAI Service to generate accurate, contextually relevant responses based on your data.

Understanding RAG Architecture

Before diving into the implementation, let’s understand the key components of a RAG system:

1. Document Processing Pipeline

This component handles:

Document ingestion from various sources
Text extraction and chunking
Embedding generation
Indexing in a vector database or search service

2. Retrieval System

This component:

Processes user queries
Converts queries to vector embeddings
Performs semantic search to find relevant documents
Ranks and filters results

3. Generation System

This component:

Constructs prompts using retrieved documents as context
Sends prompts to an LLM
Processes and formats responses
Optionally cites sources

4. User Interface

This component:

Accepts user queries
Displays responses
Provides citation information
Offers feedback mechanisms

Setting Up Your Development Environment

Let’s start by setting up a .NET project with the necessary dependencies:

bash

dotnet new webapi -n RagApplication
cd RagApplication
dotnet add package Azure.Search.Documents
dotnet add package Azure.AI.OpenAI
dotnet add package Azure.Storage.Blobs
dotnet add package PdfPig

These packages provide:

Azure Cognitive Search client
Azure OpenAI client
Azure Blob Storage client (for document storage)
PdfPig for PDF text extraction

Document Processing Pipeline

The first step in building our RAG application is creating a document processing pipeline that can ingest, process, and index documents.

Document Ingestion

Let’s create a service to upload documents to Azure Blob Storage:

csharp

using Azure.Storage.Blobs;
using Microsoft.AspNetCore.Http;
using System;
using System.IO;
using System.Threading.Tasks;

public class DocumentStorageService
{
    private readonly BlobServiceClient _blobServiceClient;
    private readonly string _containerName;

    public DocumentStorageService(string connectionString, string containerName)
    {
        _blobServiceClient = new BlobServiceClient(connectionString);
        _containerName = containerName;
        
        // Ensure container exists
        var containerClient = _blobServiceClient.GetBlobContainerClient(containerName);
        containerClient.CreateIfNotExists();
    }

    public async Task<string> UploadDocumentAsync(IFormFile file)
    {
        // Generate a unique ID for the document
        string documentId = Guid.NewGuid().ToString();
        
        // Get the file extension
        string extension = Path.GetExtension(file.FileName);
        
        // Create blob name with original filename and unique ID
        string blobName = $"{Path.GetFileNameWithoutExtension(file.FileName)}-{documentId}{extension}";
        
        // Get container client
        var containerClient = _blobServiceClient.GetBlobContainerClient(_containerName);
        
        // Get blob client
        var blobClient = containerClient.GetBlobClient(blobName);
        
        // Upload the file
        using (var stream = file.OpenReadStream())
        {
            await blobClient.UploadAsync(stream, true);
        }
        
        // Return the blob name (which we'll use as a reference)
        return blobName;
    }

    public async Task<Stream> DownloadDocumentAsync(string blobName)
    {
        var containerClient = _blobServiceClient.GetBlobContainerClient(_containerName);
        var blobClient = containerClient.GetBlobClient(blobName);
        
        var memoryStream = new MemoryStream();
        await blobClient.DownloadToAsync(memoryStream);
        memoryStream.Position = 0;
        
        return memoryStream;
    }
}

Text Extraction

Next, we need to extract text from various document formats. Let’s start with PDF processing:

csharp

using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;
using System.IO;
using System.Text;
using System.Collections.Generic;

public class TextExtractionService
{
    public string ExtractTextFromPdf(Stream pdfStream)
    {
        var textBuilder = new StringBuilder();
        
        using (var pdf = PdfDocument.Open(pdfStream))
        {
            for (var i = 1; i <= pdf.NumberOfPages; i++)
            {
                Page page = pdf.GetPage(i);
                textBuilder.AppendLine(page.Text);
            }
        }
        
        return textBuilder.ToString();
    }
    
    public List<TextChunk> ChunkText(string text, int maxChunkSize = 1000, int overlapSize = 100)
    {
        var chunks = new List<TextChunk>();
        
        // Simple chunking by splitting on paragraphs and then combining
        var paragraphs = text.Split(new[] { "\r\n\r\n", "\n\n" }, StringSplitOptions.RemoveEmptyEntries);
        
        var currentChunk = new StringBuilder();
        var currentChunkId = 0;
        
        foreach (var paragraph in paragraphs)
        {
            // If adding this paragraph would exceed the max chunk size, 
            // finalize the current chunk and start a new one
            if (currentChunk.Length + paragraph.Length > maxChunkSize && currentChunk.Length > 0)
            {
                chunks.Add(new TextChunk
                {
                    Id = currentChunkId++.ToString(),
                    Text = currentChunk.ToString().Trim()
                });
                
                // Start new chunk with overlap from the end of the previous chunk
                if (currentChunk.Length > overlapSize)
                {
                    string overlapText = currentChunk.ToString().Substring(
                        Math.Max(0, currentChunk.Length - overlapSize));
                    currentChunk = new StringBuilder(overlapText);
                }
                else
                {
                    currentChunk = new StringBuilder();
                }
            }
            
            // Add the paragraph to the current chunk
            if (currentChunk.Length > 0)
            {
                currentChunk.AppendLine();
            }
            currentChunk.AppendLine(paragraph.Trim());
        }
        
        // Add the final chunk if it's not empty
        if (currentChunk.Length > 0)
        {
            chunks.Add(new TextChunk
            {
                Id = currentChunkId.ToString(),
                Text = currentChunk.ToString().Trim()
            });
        }
        
        return chunks;
    }
}

public class TextChunk
{
    public string Id { get; set; }
    public string Text { get; set; }
    public string DocumentId { get; set; }
    public string DocumentName { get; set; }
    public int PageNumber { get; set; }
}

Embedding Generation

Now, let’s create a service to generate embeddings using Azure OpenAI:

csharp

using Azure;
using Azure.AI.OpenAI;
using System.Collections.Generic;
using System.Threading.Tasks;

public class EmbeddingService
{
    private readonly OpenAIClient _client;
    private readonly string _deploymentName;

    public EmbeddingService(string endpoint, string apiKey, string deploymentName)
    {
        _client = new OpenAIClient(new Uri(endpoint), new AzureKeyCredential(apiKey));
        _deploymentName = deploymentName;
    }

    public async Task<float[]> GenerateEmbeddingAsync(string text)
    {
        var response = await _client.GetEmbeddingsAsync(new EmbeddingsOptions(_deploymentName, new List<string> { text }));
        return response.Value.Data[0].Embedding.ToArray();
    }

    public async Task<List<float[]>> GenerateEmbeddingsAsync(List<string> texts)
    {
        var embeddings = new List<float[]>();
        
        // Process in batches to avoid rate limits
        int batchSize = 20;
        for (int i = 0; i < texts.Count; i += batchSize)
        {
            var batch = texts.Skip(i).Take(batchSize).ToList();
            var response = await _client.GetEmbeddingsAsync(new EmbeddingsOptions(_deploymentName, batch));
            
            foreach (var embedding in response.Value.Data)
            {
                embeddings.Add(embedding.Embedding.ToArray());
            }
        }
        
        return embeddings;
    }
}

Document Indexing

Now, let’s create a service to index documents in Azure Cognitive Search:

csharp

using Azure;
using Azure.Search.Documents;
using Azure.Search.Documents.Indexes;
using Azure.Search.Documents.Indexes.Models;
using Azure.Search.Documents.Models;
using System;
using System.Collections.Generic;
using System.Threading.Tasks;

public class SearchIndexingService
{
    private readonly SearchIndexClient _indexClient;
    private readonly string _indexName;

    public SearchIndexingService(string searchServiceEndpoint, string adminApiKey, string indexName)
    {
        _indexClient = new SearchIndexClient(
            new Uri(searchServiceEndpoint),
            new AzureKeyCredential(adminApiKey));
        _indexName = indexName;
    }

    public async Task CreateIndexIfNotExistsAsync()
    {
        if (!await _indexClient.GetIndexesAsync().AnyAsync(i => i.Name == _indexName))
        {
            var vectorSearchConfig = new VectorSearchConfiguration("my-algorithm", 1536);
            
            var fields = new List<SearchField>
            {
                new SearchField("id", SearchFieldDataType.String) { IsKey = true },
                new SearchField("documentId", SearchFieldDataType.String) { IsFilterable = true },
                new SearchField("documentName", SearchFieldDataType.String) { IsFilterable = true, IsSearchable = true },
                new SearchField("pageNumber", SearchFieldDataType.Int32) { IsFilterable = true },
                new SearchField("text", SearchFieldDataType.String) { IsSearchable = true },
                new SearchField("embedding", SearchFieldDataType.Collection(SearchFieldDataType.Single))
                {
                    IsSearchable = true,
                    VectorSearchDimensions = 1536,
                    VectorSearchConfiguration = "my-algorithm"
                }
            };

            var index = new SearchIndex(_indexName, fields)
            {
                VectorSearch = new VectorSearch
                {
                    Algorithms = { vectorSearchConfig }
                }
            };

            await _indexClient.CreateOrUpdateIndexAsync(index);
        }
    }

    public async Task IndexDocumentChunksAsync(List<DocumentChunk> chunks)
    {
        var searchClient = _indexClient.GetSearchClient(_indexName);
        
        // Process in batches
        int batchSize = 100;
        for (int i = 0; i < chunks.Count; i += batchSize)
        {
            var batch = chunks.Skip(i).Take(batchSize).ToList();
            await searchClient.IndexDocumentsAsync(IndexDocumentsBatch.Upload(batch));
        }
    }
}

public class DocumentChunk
{
    public string Id { get; set; }
    public string DocumentId { get; set; }
    public string DocumentName { get; set; }
    public int PageNumber { get; set; }
    public string Text { get; set; }
    public float[] Embedding { get; set; }
}

Putting It All Together: Document Processing Pipeline

Now, let’s create a service that orchestrates the entire document processing pipeline:

csharp

using Microsoft.AspNetCore.Http;
using System;
using System.Collections.Generic;
using System.Threading.Tasks;

public class DocumentProcessingService
{
    private readonly DocumentStorageService _storageService;
    private readonly TextExtractionService _textExtractionService;
    private readonly EmbeddingService _embeddingService;
    private readonly SearchIndexingService _searchIndexingService;

    public DocumentProcessingService(
        DocumentStorageService storageService,
        TextExtractionService textExtractionService,
        EmbeddingService embeddingService,
        SearchIndexingService searchIndexingService)
    {
        _storageService = storageService;
        _textExtractionService = textExtractionService;
        _embeddingService = embeddingService;
        _searchIndexingService = searchIndexingService;
    }

    public async Task ProcessDocumentAsync(IFormFile file)
    {
        // Step 1: Upload document to blob storage
        string blobName = await _storageService.UploadDocumentAsync(file);
        
        // Step 2: Download document for processing
        using var documentStream = await _storageService.DownloadDocumentAsync(blobName);
        
        // Step 3: Extract text from document
        string text = _textExtractionService.ExtractTextFromPdf(documentStream);
        
        // Step 4: Chunk the text
        var textChunks = _textExtractionService.ChunkText(text);
        
        // Set document metadata for each chunk
        foreach (var chunk in textChunks)
        {
            chunk.DocumentId = blobName;
            chunk.DocumentName = file.FileName;
            // In a real application, you would extract page numbers from the PDF
            chunk.PageNumber = 1;
        }
        
        // Step 5: Generate embeddings for each chunk
        var texts = textChunks.Select(c => c.Text).ToList();
        var embeddings = await _embeddingService.GenerateEmbeddingsAsync(texts);
        
        // Step 6: Create document chunks for indexing
        var documentChunks = new List<DocumentChunk>();
        for (int i = 0; i < textChunks.Count; i++)
        {
            documentChunks.Add(new DocumentChunk
            {
                Id = textChunks[i].Id,
                DocumentId = textChunks[i].DocumentId,
                DocumentName = textChunks[i].DocumentName,
                PageNumber = textChunks[i].PageNumber,
                Text = textChunks[i].Text,
                Embedding = embeddings[i]
            });
        }
        
        // Step 7: Ensure search index exists
        await _searchIndexingService.CreateIndexIfNotExistsAsync();
        
        // Step 8: Index the document chunks
        await _searchIndexingService.IndexDocumentChunksAsync(documentChunks);
    }
}