Summary: This post explores how to implement Retrieval-Augmented Generation (RAG) applications using Azure Cognitive Search and .NET. Learn how to create a powerful document search and Q&A system that combines the knowledge from your own data with the capabilities of large language models.
Introduction
As large language models (LLMs) like GPT-4 become increasingly powerful, developers are finding innovative ways to enhance these models with domain-specific knowledge. One of the most effective approaches is Retrieval-Augmented Generation (RAG), which combines the power of search technology with generative AI.
RAG addresses several key limitations of pure LLM approaches:
- Knowledge cutoff: LLMs only know information up to their training cutoff date
- Hallucinations: LLMs can sometimes generate plausible-sounding but incorrect information
- Domain-specific knowledge: Organizations need AI systems that understand their proprietary data
- Data freshness: Information changes rapidly, and models need access to the latest data
In this post, we’ll explore how to build a RAG application using Azure Cognitive Search (recently enhanced with vector search capabilities) and .NET. We’ll create a system that can ingest documents, index them for efficient retrieval, and use Azure OpenAI Service to generate accurate, contextually relevant responses based on your data.
Understanding RAG Architecture
Before diving into the implementation, let’s understand the key components of a RAG system:
1. Document Processing Pipeline
This component handles:
- Document ingestion from various sources
- Text extraction and chunking
- Embedding generation
- Indexing in a vector database or search service
2. Retrieval System
This component:
- Processes user queries
- Converts queries to vector embeddings
- Performs semantic search to find relevant documents
- Ranks and filters results
3. Generation System
This component:
- Constructs prompts using retrieved documents as context
- Sends prompts to an LLM
- Processes and formats responses
- Optionally cites sources
4. User Interface
This component:
- Accepts user queries
- Displays responses
- Provides citation information
- Offers feedback mechanisms
Setting Up Your Development Environment
Let’s start by setting up a .NET project with the necessary dependencies:
bash
dotnet new webapi -n RagApplication
cd RagApplication
dotnet add package Azure.Search.Documents
dotnet add package Azure.AI.OpenAI
dotnet add package Azure.Storage.Blobs
dotnet add package PdfPig
These packages provide:
- Azure Cognitive Search client
- Azure OpenAI client
- Azure Blob Storage client (for document storage)
- PdfPig for PDF text extraction
Document Processing Pipeline
The first step in building our RAG application is creating a document processing pipeline that can ingest, process, and index documents.
Document Ingestion
Let’s create a service to upload documents to Azure Blob Storage:
csharp
using Azure.Storage.Blobs;
using Microsoft.AspNetCore.Http;
using System;
using System.IO;
using System.Threading.Tasks;
public class DocumentStorageService
{
private readonly BlobServiceClient _blobServiceClient;
private readonly string _containerName;
public DocumentStorageService(string connectionString, string containerName)
{
_blobServiceClient = new BlobServiceClient(connectionString);
_containerName = containerName;
// Ensure container exists
var containerClient = _blobServiceClient.GetBlobContainerClient(containerName);
containerClient.CreateIfNotExists();
}
public async Task<string> UploadDocumentAsync(IFormFile file)
{
// Generate a unique ID for the document
string documentId = Guid.NewGuid().ToString();
// Get the file extension
string extension = Path.GetExtension(file.FileName);
// Create blob name with original filename and unique ID
string blobName = $"{Path.GetFileNameWithoutExtension(file.FileName)}-{documentId}{extension}";
// Get container client
var containerClient = _blobServiceClient.GetBlobContainerClient(_containerName);
// Get blob client
var blobClient = containerClient.GetBlobClient(blobName);
// Upload the file
using (var stream = file.OpenReadStream())
{
await blobClient.UploadAsync(stream, true);
}
// Return the blob name (which we'll use as a reference)
return blobName;
}
public async Task<Stream> DownloadDocumentAsync(string blobName)
{
var containerClient = _blobServiceClient.GetBlobContainerClient(_containerName);
var blobClient = containerClient.GetBlobClient(blobName);
var memoryStream = new MemoryStream();
await blobClient.DownloadToAsync(memoryStream);
memoryStream.Position = 0;
return memoryStream;
}
}
Text Extraction
Next, we need to extract text from various document formats. Let’s start with PDF processing:
csharp
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;
using System.IO;
using System.Text;
using System.Collections.Generic;
public class TextExtractionService
{
public string ExtractTextFromPdf(Stream pdfStream)
{
var textBuilder = new StringBuilder();
using (var pdf = PdfDocument.Open(pdfStream))
{
for (var i = 1; i <= pdf.NumberOfPages; i++)
{
Page page = pdf.GetPage(i);
textBuilder.AppendLine(page.Text);
}
}
return textBuilder.ToString();
}
public List<TextChunk> ChunkText(string text, int maxChunkSize = 1000, int overlapSize = 100)
{
var chunks = new List<TextChunk>();
// Simple chunking by splitting on paragraphs and then combining
var paragraphs = text.Split(new[] { "\r\n\r\n", "\n\n" }, StringSplitOptions.RemoveEmptyEntries);
var currentChunk = new StringBuilder();
var currentChunkId = 0;
foreach (var paragraph in paragraphs)
{
// If adding this paragraph would exceed the max chunk size,
// finalize the current chunk and start a new one
if (currentChunk.Length + paragraph.Length > maxChunkSize && currentChunk.Length > 0)
{
chunks.Add(new TextChunk
{
Id = currentChunkId++.ToString(),
Text = currentChunk.ToString().Trim()
});
// Start new chunk with overlap from the end of the previous chunk
if (currentChunk.Length > overlapSize)
{
string overlapText = currentChunk.ToString().Substring(
Math.Max(0, currentChunk.Length - overlapSize));
currentChunk = new StringBuilder(overlapText);
}
else
{
currentChunk = new StringBuilder();
}
}
// Add the paragraph to the current chunk
if (currentChunk.Length > 0)
{
currentChunk.AppendLine();
}
currentChunk.AppendLine(paragraph.Trim());
}
// Add the final chunk if it's not empty
if (currentChunk.Length > 0)
{
chunks.Add(new TextChunk
{
Id = currentChunkId.ToString(),
Text = currentChunk.ToString().Trim()
});
}
return chunks;
}
}
public class TextChunk
{
public string Id { get; set; }
public string Text { get; set; }
public string DocumentId { get; set; }
public string DocumentName { get; set; }
public int PageNumber { get; set; }
}
Embedding Generation
Now, let’s create a service to generate embeddings using Azure OpenAI:
csharp
using Azure;
using Azure.AI.OpenAI;
using System.Collections.Generic;
using System.Threading.Tasks;
public class EmbeddingService
{
private readonly OpenAIClient _client;
private readonly string _deploymentName;
public EmbeddingService(string endpoint, string apiKey, string deploymentName)
{
_client = new OpenAIClient(new Uri(endpoint), new AzureKeyCredential(apiKey));
_deploymentName = deploymentName;
}
public async Task<float[]> GenerateEmbeddingAsync(string text)
{
var response = await _client.GetEmbeddingsAsync(new EmbeddingsOptions(_deploymentName, new List<string> { text }));
return response.Value.Data[0].Embedding.ToArray();
}
public async Task<List<float[]>> GenerateEmbeddingsAsync(List<string> texts)
{
var embeddings = new List<float[]>();
// Process in batches to avoid rate limits
int batchSize = 20;
for (int i = 0; i < texts.Count; i += batchSize)
{
var batch = texts.Skip(i).Take(batchSize).ToList();
var response = await _client.GetEmbeddingsAsync(new EmbeddingsOptions(_deploymentName, batch));
foreach (var embedding in response.Value.Data)
{
embeddings.Add(embedding.Embedding.ToArray());
}
}
return embeddings;
}
}
Document Indexing
Now, let’s create a service to index documents in Azure Cognitive Search:
csharp
using Azure;
using Azure.Search.Documents;
using Azure.Search.Documents.Indexes;
using Azure.Search.Documents.Indexes.Models;
using Azure.Search.Documents.Models;
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
public class SearchIndexingService
{
private readonly SearchIndexClient _indexClient;
private readonly string _indexName;
public SearchIndexingService(string searchServiceEndpoint, string adminApiKey, string indexName)
{
_indexClient = new SearchIndexClient(
new Uri(searchServiceEndpoint),
new AzureKeyCredential(adminApiKey));
_indexName = indexName;
}
public async Task CreateIndexIfNotExistsAsync()
{
if (!await _indexClient.GetIndexesAsync().AnyAsync(i => i.Name == _indexName))
{
var vectorSearchConfig = new VectorSearchConfiguration("my-algorithm", 1536);
var fields = new List<SearchField>
{
new SearchField("id", SearchFieldDataType.String) { IsKey = true },
new SearchField("documentId", SearchFieldDataType.String) { IsFilterable = true },
new SearchField("documentName", SearchFieldDataType.String) { IsFilterable = true, IsSearchable = true },
new SearchField("pageNumber", SearchFieldDataType.Int32) { IsFilterable = true },
new SearchField("text", SearchFieldDataType.String) { IsSearchable = true },
new SearchField("embedding", SearchFieldDataType.Collection(SearchFieldDataType.Single))
{
IsSearchable = true,
VectorSearchDimensions = 1536,
VectorSearchConfiguration = "my-algorithm"
}
};
var index = new SearchIndex(_indexName, fields)
{
VectorSearch = new VectorSearch
{
Algorithms = { vectorSearchConfig }
}
};
await _indexClient.CreateOrUpdateIndexAsync(index);
}
}
public async Task IndexDocumentChunksAsync(List<DocumentChunk> chunks)
{
var searchClient = _indexClient.GetSearchClient(_indexName);
// Process in batches
int batchSize = 100;
for (int i = 0; i < chunks.Count; i += batchSize)
{
var batch = chunks.Skip(i).Take(batchSize).ToList();
await searchClient.IndexDocumentsAsync(IndexDocumentsBatch.Upload(batch));
}
}
}
public class DocumentChunk
{
public string Id { get; set; }
public string DocumentId { get; set; }
public string DocumentName { get; set; }
public int PageNumber { get; set; }
public string Text { get; set; }
public float[] Embedding { get; set; }
}
Putting It All Together: Document Processing Pipeline
Now, let’s create a service that orchestrates the entire document processing pipeline:
csharp
using Microsoft.AspNetCore.Http;
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
public class DocumentProcessingService
{
private readonly DocumentStorageService _storageService;
private readonly TextExtractionService _textExtractionService;
private readonly EmbeddingService _embeddingService;
private readonly SearchIndexingService _searchIndexingService;
public DocumentProcessingService(
DocumentStorageService storageService,
TextExtractionService textExtractionService,
EmbeddingService embeddingService,
SearchIndexingService searchIndexingService)
{
_storageService = storageService;
_textExtractionService = textExtractionService;
_embeddingService = embeddingService;
_searchIndexingService = searchIndexingService;
}
public async Task ProcessDocumentAsync(IFormFile file)
{
// Step 1: Upload document to blob storage
string blobName = await _storageService.UploadDocumentAsync(file);
// Step 2: Download document for processing
using var documentStream = await _storageService.DownloadDocumentAsync(blobName);
// Step 3: Extract text from document
string text = _textExtractionService.ExtractTextFromPdf(documentStream);
// Step 4: Chunk the text
var textChunks = _textExtractionService.ChunkText(text);
// Set document metadata for each chunk
foreach (var chunk in textChunks)
{
chunk.DocumentId = blobName;
chunk.DocumentName = file.FileName;
// In a real application, you would extract page numbers from the PDF
chunk.PageNumber = 1;
}
// Step 5: Generate embeddings for each chunk
var texts = textChunks.Select(c => c.Text).ToList();
var embeddings = await _embeddingService.GenerateEmbeddingsAsync(texts);
// Step 6: Create document chunks for indexing
var documentChunks = new List<DocumentChunk>();
for (int i = 0; i < textChunks.Count; i++)
{
documentChunks.Add(new DocumentChunk
{
Id = textChunks[i].Id,
DocumentId = textChunks[i].DocumentId,
DocumentName = textChunks[i].DocumentName,
PageNumber = textChunks[i].PageNumber,
Text = textChunks[i].Text,
Embedding = embeddings[i]
});
}
// Step 7: Ensure search index exists
await _searchIndexingService.CreateIndexIfNotExistsAsync();
// Step 8: Index the document chunks
await _searchIndexingService.IndexDocumentChunksAsync(documentChunks);
}
}