Summary: This post explores performance optimization techniques for AI-powered .NET applications. Learn how to identify bottlenecks, implement caching strategies, optimize model loading and inference, and scale your applications to handle high loads while maintaining responsiveness.
Introduction
As AI capabilities become increasingly integrated into .NET applications, developers face new challenges in ensuring these applications perform well under real-world conditions. AI models, particularly large language models (LLMs), can introduce significant computational overhead that impacts application performance.
Performance optimization is critical for AI-powered applications because poor performance can lead to frustrated users, increased costs, and limited scalability. In this post, we’ll explore practical techniques for optimizing the performance of AI-powered .NET applications, with a focus on applications that integrate with services like Azure OpenAI, use local models like LLaMA 2 via OLLAMA, or leverage Microsoft Semantic Kernel.
We’ll cover strategies for identifying performance bottlenecks, implementing effective caching, optimizing model loading and inference, and scaling your applications to handle high loads. By applying these techniques, you can build AI-powered .NET applications that are both powerful and performant.
Understanding Performance Challenges in AI Applications
Before diving into optimization techniques, let’s understand the unique performance challenges that AI-powered applications face.
Computational Intensity
AI models, especially large language models, require significant computational resources. Operations like generating text, creating embeddings, or processing images can be CPU and memory-intensive, potentially causing:
- Slow response times
- High resource utilization
- Increased costs (for cloud-based resources)
- Application timeouts
Network Latency
When using cloud-based AI services like Azure OpenAI Service, network latency becomes a significant factor:
- Round-trip time for API calls
- Bandwidth limitations for large requests/responses
- Network reliability issues
- API rate limits and throttling
Memory Consumption
AI models can consume substantial amounts of memory:
- Large model weights (especially for local models)
- Batch processing of inputs
- Vector embeddings storage
- Context windows for LLMs
Scaling Challenges
AI workloads often don’t scale linearly:
- Concurrent requests may compete for resources
- Some operations can’t be easily parallelized
- Resource contention can cause performance degradation
- Cold starts can impact serverless deployments
Identifying Performance Bottlenecks
The first step in optimization is identifying where your application’s performance bottlenecks lie.
Profiling Tools
.NET provides several profiling tools to help identify bottlenecks:
- Visual Studio Profiler: Provides CPU, memory, and performance analysis
- dotTrace: JetBrains’ performance profiler for .NET applications
- Application Insights: Azure’s application performance management service
- PerfView: Microsoft’s performance analysis tool
Let’s see how to use the Visual Studio Profiler to identify bottlenecks:
csharp
// First, instrument your code with simple timing measurements
public async Task<string> GenerateTextAsync(string prompt)
{
var stopwatch = Stopwatch.StartNew();
try
{
var response = await _openAIClient.GetChatCompletionsAsync(
deploymentOrModelName: "gpt-4",
new ChatCompletionsOptions
{
Messages = { new ChatMessage(ChatRole.User, prompt) },
Temperature = 0.7f,
MaxTokens = 800
});
var result = response.Value.Choices[0].Message.Content;
return result;
}
finally
{
stopwatch.Stop();
_logger.LogInformation($"Text generation took {stopwatch.ElapsedMilliseconds}ms for prompt: {prompt.Substring(0, Math.Min(50, prompt.Length))}...");
}
}
Custom Telemetry
Implement custom telemetry to track AI-specific metrics:
csharp
public class AiTelemetry
{
private readonly TelemetryClient _telemetryClient;
public AiTelemetry(TelemetryClient telemetryClient)
{
_telemetryClient = telemetryClient;
}
public void TrackModelInference(string modelName, string operationType, long durationMs, int inputTokens, int outputTokens)
{
var properties = new Dictionary<string, string>
{
["ModelName"] = modelName,
["OperationType"] = operationType
};
var metrics = new Dictionary<string, double>
{
["DurationMs"] = durationMs,
["InputTokens"] = inputTokens,
["OutputTokens"] = outputTokens,
["TotalTokens"] = inputTokens + outputTokens
};
_telemetryClient.TrackEvent("ModelInference", properties, metrics);
}
public void TrackApiCall(string apiName, long durationMs, bool success, string errorMessage = null)
{
var properties = new Dictionary<string, string>
{
["ApiName"] = apiName,
["Success"] = success.ToString(),
["ErrorMessage"] = errorMessage ?? string.Empty
};
var metrics = new Dictionary<string, double>
{
["DurationMs"] = durationMs
};
_telemetryClient.TrackEvent("ApiCall", properties, metrics);
}
}
Performance Benchmarking
Create benchmarks to measure the performance of critical operations:
csharp
// Install the BenchmarkDotNet package
// dotnet add package BenchmarkDotNet
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
[MemoryDiagnoser]
public class AiOperationsBenchmark
{
private readonly OpenAIClient _openAIClient;
private readonly EmbeddingService _embeddingService;
private readonly string _testPrompt;
public AiOperationsBenchmark()
{
_openAIClient = new OpenAIClient(
new Uri("https://your-resource.openai.azure.com/" ),
new AzureKeyCredential("your-api-key"));
_embeddingService = new EmbeddingService(_openAIClient);
_testPrompt = "Explain the concept of dependency injection in .NET";
}
[Benchmark]
public async Task<string> TextGeneration()
{
var response = await _openAIClient.GetChatCompletionsAsync(
deploymentOrModelName: "gpt-35-turbo",
new ChatCompletionsOptions
{
Messages = { new ChatMessage(ChatRole.User, _testPrompt) },
Temperature = 0.7f,
MaxTokens = 500
});
return response.Value.Choices[0].Message.Content;
}
[Benchmark]
public async Task<float[]> GenerateEmbedding()
{
return await _embeddingService.GenerateEmbeddingAsync(_testPrompt);
}
public static void Main(string[] args)
{
var summary = BenchmarkRunner.Run<AiOperationsBenchmark>();
}
}
Implementing Caching Strategies
Caching is one of the most effective ways to improve the performance of AI-powered applications. Let’s explore different caching strategies.
Response Caching
Cache responses from AI models to avoid redundant API calls:
csharp
public class CachingOpenAIService : IOpenAIService
{
private readonly OpenAIClient _openAIClient;
private readonly IMemoryCache _cache;
private readonly TimeSpan _cacheDuration;
public CachingOpenAIService(
OpenAIClient openAIClient,
IMemoryCache cache,
TimeSpan? cacheDuration = null)
{
_openAIClient = openAIClient;
_cache = cache;
_cacheDuration = cacheDuration ?? TimeSpan.FromHours(1);
}
public async Task<string> GenerateTextAsync(string prompt, float temperature = 0.7f, int maxTokens = 800)
{
// Create a cache key based on the input parameters
string cacheKey = $"text_generation:{prompt}:{temperature}:{maxTokens}";
// Try to get the result from cache
if (_cache.TryGetValue(cacheKey, out string cachedResult))
{
return cachedResult;
}
// If not in cache, call the API
var response = await _openAIClient.GetChatCompletionsAsync(
deploymentOrModelName: "gpt-4",
new ChatCompletionsOptions
{
Messages = { new ChatMessage(ChatRole.User, prompt) },
Temperature = temperature,
MaxTokens = maxTokens
});
var result = response.Value.Choices[0].Message.Content;
// Cache the result
_cache.Set(cacheKey, result, _cacheDuration);
return result;
}
}
Embedding Caching
Caching embeddings is particularly valuable since they’re deterministic and computationally expensive:
csharp
public class CachingEmbeddingService : IEmbeddingService
{
private readonly OpenAIClient _openAIClient;
private readonly IDistributedCache _cache;
private readonly TimeSpan _cacheDuration;
public CachingEmbeddingService(
OpenAIClient openAIClient,
IDistributedCache cache,
TimeSpan? cacheDuration = null)
{
_openAIClient = openAIClient;
_cache = cache;
_cacheDuration = cacheDuration ?? TimeSpan.FromDays(30); // Embeddings rarely change
}
public async Task<float[]> GenerateEmbeddingAsync(string text)
{
// Normalize the text to improve cache hits
string normalizedText = NormalizeText(text);
// Create a cache key
string cacheKey = $"embedding:{ComputeHash(normalizedText)}";
// Try to get from cache
byte[] cachedData = await _cache.GetAsync(cacheKey);
if (cachedData != null)
{
return DeserializeEmbedding(cachedData);
}
// If not in cache, generate the embedding
var response = await _openAIClient.GetEmbeddingsAsync(
deploymentOrModelName: "text-embedding-ada-002",
new EmbeddingsOptions(normalizedText));
var embedding = response.Value.Data[0].Embedding.ToArray();
// Cache the embedding
await _cache.SetAsync(
cacheKey,
SerializeEmbedding(embedding),
new DistributedCacheEntryOptions { AbsoluteExpirationRelativeToNow = _cacheDuration });
return embedding;
}
private string NormalizeText(string text)
{
// Simple normalization: trim whitespace and convert to lowercase
return text.Trim().ToLowerInvariant();
}
private string ComputeHash(string text)
{
using var sha256 = System.Security.Cryptography.SHA256.Create();
var bytes = System.Text.Encoding.UTF8.GetBytes(text);
var hash = sha256.ComputeHash(bytes);
return Convert.ToBase64String(hash);
}
private byte[] SerializeEmbedding(float[] embedding)
{
using var stream = new MemoryStream();
using var writer = new BinaryWriter(stream);
writer.Write(embedding.Length);
foreach (var value in embedding)
{
writer.Write(value);
}
return stream.ToArray();
}
private float[] DeserializeEmbedding(byte[] data)
{
using var stream = new MemoryStream(data);
using var reader = new BinaryReader(stream);
int length = reader.ReadInt32();
var embedding = new float[length];
for (int i = 0; i < length; i++)
{
embedding[i] = reader.ReadSingle();
}
return embedding;
}
}
Distributed Caching with Redis
For applications that run across multiple instances, use a distributed cache like Redis:
csharp
// In Program.cs or Startup.cs
services.AddStackExchangeRedisCache(options =>
{
options.Configuration = Configuration.GetConnectionString("Redis");
options.InstanceName = "AiApp_";
});
services.AddSingleton<IEmbeddingService, CachingEmbeddingService>();
services.AddSingleton<IOpenAIService, CachingOpenAIService>();
Semantic Caching
Implement semantic caching to handle similar (but not identical) requests:
csharp
public class SemanticCachingService
{
private readonly IEmbeddingService _embeddingService;
private readonly IDistributedCache _cache;
private readonly float _similarityThreshold;
private readonly ConcurrentDictionary<string, (float[] Embedding, string CacheKey)> _embeddingCache
= new ConcurrentDictionary<string, (float[] Embedding, string CacheKey)>();
public SemanticCachingService(
IEmbeddingService embeddingService,
IDistributedCache cache,
float similarityThreshold = 0.95f)
{
_embeddingService = embeddingService;
_cache = cache;
_similarityThreshold = similarityThreshold;
}
public async Task<(bool Found, string Result)> TryGetFromCacheAsync(string query)
{
// Generate embedding for the query
var queryEmbedding = await _embeddingService.GenerateEmbeddingAsync(query);
// Find the most similar cached query
var mostSimilar = _embeddingCache
.Select(kv => (
CacheKey: kv.Value.CacheKey,
Similarity: CosineSimilarity(queryEmbedding, kv.Value.Embedding)
))
.Where(item => item.Similarity >= _similarityThreshold)
.OrderByDescending(item => item.Similarity)
.FirstOrDefault();
if (mostSimilar.CacheKey != null)
{
// Get the cached result
var cachedResult = await _cache.GetStringAsync(mostSimilar.CacheKey);
if (cachedResult != null)
{
return (true, cachedResult);
}
}
return (false, null);
}
public async Task CacheResultAsync(string query, string result, TimeSpan expiration)
{
// Generate embedding for the query
var queryEmbedding = await _embeddingService.GenerateEmbeddingAsync(query);
// Create a cache key
string cacheKey = $"semantic:{ComputeHash(query)}";
// Store the embedding in memory for similarity lookups
_embeddingCache[query] = (queryEmbedding, cacheKey);
// Cache the result
await _cache.SetStringAsync(
cacheKey,
result,
new DistributedCacheEntryOptions { AbsoluteExpirationRelativeToNow = expiration });
}
private float CosineSimilarity(float[] a, float[] b)
{
float dotProduct = 0;
float normA = 0;
float normB = 0;
for (int i = 0; i < a.Length; i++)
{
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (float)(Math.Sqrt(normA) * Math.Sqrt(normB));
}
private string ComputeHash(string text)
{
using var sha256 = System.Security.Cryptography.SHA256.Create();
var bytes = System.Text.Encoding.UTF8.GetBytes(text);
var hash = sha256.ComputeHash(bytes);
return Convert.ToBase64String(hash);
}
}
Optimizing Model Loading and Inference
For applications that use local models or Microsoft Semantic Kernel, optimizing model loading and inference is crucial.