Fine-Tuning LLMs for Domain-Specific Applications in .NET

Summary: This post explores how to fine-tune large language models for domain-specific applications using .NET. Learn how to prepare training data, fine-tune models, and deploy them in .NET applications to achieve better performance on specialized tasks.

Introduction

Large Language Models (LLMs) have revolutionized natural language processing with their ability to understand and generate human-like text. While general-purpose LLMs like GPT-4 and Claude are incredibly versatile, they can be further enhanced for specific domains or tasks through fine-tuning.

Fine-tuning allows you to adapt a pre-trained LLM to your specific domain, improving its performance on specialized tasks while requiring significantly less data and computational resources than training a model from scratch. This approach is particularly valuable for industries with specialized terminology, unique requirements, or specific compliance needs.

In this post, we’ll explore how to fine-tune LLMs for domain-specific applications using .NET. We’ll cover everything from preparing training data to implementing the fine-tuning process and deploying the resulting models in .NET applications. By the end of this article, you’ll have the knowledge to create more accurate and efficient AI solutions tailored to your specific domain.

Understanding LLM Fine-Tuning

Before diving into implementation, let’s understand the key concepts behind LLM fine-tuning.

What is Fine-Tuning?

Fine-tuning is the process of further training a pre-trained language model on a smaller, domain-specific dataset to adapt it to particular tasks or knowledge domains. This process involves:

  1. Starting with a Pre-trained Model: Using a model that has already been trained on a large corpus of text
  2. Preparing Domain-Specific Data: Creating a dataset that represents your specific use case
  3. Additional Training: Continuing the training process with your data, but with a lower learning rate
  4. Evaluation: Measuring the performance of the fine-tuned model on your specific tasks

Benefits of Fine-Tuning

Fine-tuning offers several advantages over using general-purpose models or training from scratch:

  • Improved Performance: Better accuracy and relevance for domain-specific tasks
  • Reduced Prompt Engineering: Less need for complex prompts to guide the model
  • Lower Inference Costs: Potentially smaller models with comparable performance
  • Reduced Token Usage: More efficient responses with less verbosity
  • Consistency: More consistent outputs aligned with your requirements
  • Proprietary Knowledge: Incorporation of proprietary information not in the original training data

When to Fine-Tune vs. Other Approaches

Fine-tuning is not always the best approach. Here’s when to consider it versus alternatives:

ApproachBest When
Prompt EngineeringYou have limited examples, need quick implementation, or have simple requirements
Retrieval-Augmented Generation (RAG)You need to incorporate large amounts of factual information or frequently updated content
Fine-TuningYou have many examples of desired inputs/outputs, need consistent formatting, or require specialized behavior
Training from ScratchYou have massive amounts of data and computational resources, and existing models are fundamentally unsuitable

Setting Up the Development Environment

Let’s start by setting up our development environment for LLM fine-tuning.

Prerequisites

To follow along with this tutorial, you’ll need:

  • Visual Studio 2022 or Visual Studio Code
  • .NET 8 SDK
  • An Azure subscription
  • Access to Azure OpenAI Service
  • Basic understanding of machine learning concepts

Creating a New Project

Let’s create a new .NET project for our fine-tuning work:

bash

dotnet new console -n LlmFineTuning
cd LlmFineTuning

Installing Required Packages

Add the necessary packages to your project:

bash

dotnet add package Azure.AI.OpenAI
dotnet add package Microsoft.ML
dotnet add package Microsoft.Extensions.Configuration
dotnet add package Microsoft.Extensions.Configuration.Json
dotnet add package Microsoft.Extensions.Configuration.EnvironmentVariables
dotnet add package CsvHelper

Setting Up Configuration

Create an appsettings.json file to store your configuration:

json

{
  "AzureOpenAI": {
    "Endpoint": "https://your-openai-service.openai.azure.com/",
    "Key": "your-openai-api-key",
    "DeploymentName": "your-base-model-deployment"
  },
  "FineTuning": {
    "TrainingDataPath": "data/training_data.jsonl",
    "ValidationDataPath": "data/validation_data.jsonl",
    "ModelName": "gpt-35-turbo",
    "FineTunedModelName": "my-domain-model"
  }
}

Creating a Configuration Helper

Let’s create a helper class to load our configuration:

csharp

// Configuration/ConfigurationHelper.cs
using Microsoft.Extensions.Configuration;

namespace LlmFineTuning.Configuration
{
    public static class ConfigurationHelper
    {
        public static IConfiguration GetConfiguration( )
        {
            return new ConfigurationBuilder()
                .AddJsonFile("appsettings.json", optional: false, reloadOnChange: true)
                .AddEnvironmentVariables()
                .Build();
        }
    }
}

Preparing Training Data for Fine-Tuning

The quality and format of your training data are crucial for successful fine-tuning. Let’s explore how to prepare effective training data.

Understanding Training Data Requirements

Different LLM providers have specific requirements for fine-tuning data. For Azure OpenAI Service, the data should be in JSONL format (one JSON object per line) with specific fields:

json

{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello, who are you?"}, {"role": "assistant", "content": "I am an AI assistant created by OpenAI. How can I help you today?"}]}
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "The capital of France is Paris."}]}

Each JSON object represents a complete conversation, with messages from different roles (system, user, assistant).

Creating a Data Preparation Tool

Let’s create a tool to help prepare our training data:

csharp

// DataPreparation/TrainingDataPreparer.cs
using System.Text.Json;
using System.Text.Json.Serialization;

namespace LlmFineTuning.DataPreparation
{
    public class TrainingDataPreparer
    {
        private readonly string _outputPath;
        private readonly JsonSerializerOptions _jsonOptions;

        public TrainingDataPreparer(string outputPath)
        {
            _outputPath = outputPath;
            _jsonOptions = new JsonSerializerOptions
            {
                PropertyNamingPolicy = JsonNamingPolicy.CamelCase,
                DefaultIgnoreCondition = JsonIgnoreCondition.WhenWritingNull
            };
        }

        public async Task PrepareFromCsvAsync(string csvPath, string systemPrompt)
        {
            // Ensure directory exists
            Directory.CreateDirectory(Path.GetDirectoryName(_outputPath));

            // Read CSV file
            using var reader = new StreamReader(csvPath);
            using var csv = new CsvHelper.CsvReader(reader, System.Globalization.CultureInfo.InvariantCulture);
            
            var records = csv.GetRecords<TrainingRecord>().ToList();
            
            // Create training examples
            using var writer = new StreamWriter(_outputPath);
            
            foreach (var record in records)
            {
                var example = new TrainingExample
                {
                    Messages = new List<Message>
                    {
                        new Message { Role = "system", Content = systemPrompt },
                        new Message { Role = "user", Content = record.UserQuery },
                        new Message { Role = "assistant", Content = record.AssistantResponse }
                    }
                };
                
                await writer.WriteLineAsync(JsonSerializer.Serialize(example, _jsonOptions));
            }
            
            Console.WriteLine($"Created {records.Count} training examples at {_outputPath}");
        }

        public async Task PrepareFromConversationsAsync(string conversationsPath)
        {
            // Ensure directory exists
            Directory.CreateDirectory(Path.GetDirectoryName(_outputPath));

            // Read conversations file
            var conversations = JsonSerializer.Deserialize<List<Conversation>>(
                await File.ReadAllTextAsync(conversationsPath),
                _jsonOptions);
            
            // Create training examples
            using var writer = new StreamWriter(_outputPath);
            
            foreach (var conversation in conversations)
            {
                var example = new TrainingExample
                {
                    Messages = conversation.Messages
                };
                
                await writer.WriteLineAsync(JsonSerializer.Serialize(example, _jsonOptions));
            }
            
            Console.WriteLine($"Created {conversations.Count} training examples at {_outputPath}");
        }

        public async Task SplitDataForValidationAsync(string inputPath, string trainingPath, string validationPath, double validationPercentage = 0.1)
        {
            // Read all lines from input file
            var lines = await File.ReadAllLinesAsync(inputPath);
            
            // Shuffle the lines
            var random = new Random(42); // Fixed seed for reproducibility
            var shuffledLines = lines.OrderBy(x => random.Next()).ToList();
            
            // Calculate split
            int validationCount = (int)(shuffledLines.Count * validationPercentage);
            int trainingCount = shuffledLines.Count - validationCount;
            
            // Write training data
            await File.WriteAllLinesAsync(trainingPath, shuffledLines.Take(trainingCount));
            
            // Write validation data
            await File.WriteAllLinesAsync(validationPath, shuffledLines.Skip(trainingCount));
            
            Console.WriteLine($"Split {shuffledLines.Count} examples into {trainingCount} training and {validationCount} validation examples");
        }
    }

    public class TrainingRecord
    {
        public string UserQuery { get; set; }
        public string AssistantResponse { get; set; }
    }

    public class Conversation
    {
        public List<Message> Messages { get; set; }
    }

    public class TrainingExample
    {
        public List<Message> Messages { get; set; }
    }

    public class Message
    {
        public string Role { get; set; }
        public string Content { get; set; }
    }
}

Creating Domain-Specific Training Data

Let’s create a sample program to generate training data for a specific domain. In this example, we’ll create data for a healthcare assistant:

csharp

// Program.cs
using LlmFineTuning.Configuration;
using LlmFineTuning.DataPreparation;
using Microsoft.Extensions.Configuration;

// Load configuration
var config = ConfigurationHelper.GetConfiguration();
var trainingDataPath = config["FineTuning:TrainingDataPath"];
var validationDataPath = config["FineTuning:ValidationDataPath"];

// Create temporary path for all data
var allDataPath = "data/all_data.jsonl";

// Create system prompt for healthcare assistant
var systemPrompt = @"You are a healthcare assistant specialized in providing information about medications, 
treatments, and general health advice. You should provide accurate, helpful information while being mindful 
of medical ethics and privacy concerns. Always clarify that you're not a replacement for professional 
medical advice and encourage users to consult healthcare providers for specific medical concerns.";

// Sample healthcare Q&A pairs
var healthcareQA = new List<(string Query, string Response)>
{
    (
        "What are the common side effects of ibuprofen?",
        @"Common side effects of ibuprofen may include:
        
1. Upset stomach or indigestion
2. Heartburn
3. Nausea or vomiting
4. Headache
5. Dizziness
6. Mild allergic reactions like rash
7. Increased blood pressure
8. Fluid retention (edema)

More serious but less common side effects can include:
- Stomach bleeding or ulcers
- Kidney problems
- Liver issues
- Severe allergic reactions

It's important to take ibuprofen as directed and consult with a healthcare provider if you experience persistent or severe side effects. Remember that this information is not a substitute for professional medical advice."
    ),
    (
        "How can I manage my diabetes through diet?",
        @"Managing diabetes through diet involves several key strategies:

1. Carbohydrate management:
   - Monitor carbohydrate intake consistently
   - Focus on complex carbohydrates with lower glycemic index
   - Spread carbohydrate consumption throughout the day

2. Recommended foods:
   - Non-starchy vegetables (spinach, broccoli, peppers)
   - Whole grains (brown rice, whole wheat, quinoa)
   - Lean proteins (chicken, fish, tofu, legumes)
   - Healthy fats (avocados, nuts, olive oil)
   - Fruits in moderation (berries are particularly good choices)

3. Foods to limit:
   - Refined carbohydrates and sugary foods
   - Processed foods high in sodium and trans fats
   - Sugary beverages
   - Alcohol (consult your doctor about safe limits)

4. Additional tips:
   - Maintain consistent meal timing
   - Practice portion control
   - Stay hydrated with water
   - Consider working with a registered dietitian

Remember that individual dietary needs vary based on factors like diabetes type, medication, activity level, and overall health. It's essential to work with your healthcare team to develop a personalized meal plan. This information is not a substitute for professional medical advice."
    ),
    // Add more Q&A pairs here
};

// Create training data preparer
var dataPreparer = new TrainingDataPreparer(allDataPath);

// Create a CSV file with the Q&A pairs
var csvPath = "data/healthcare_qa.csv";
Directory.CreateDirectory(Path.GetDirectoryName(csvPath));

using (var writer = new StreamWriter(csvPath))
{
    writer.WriteLine("UserQuery,AssistantResponse");
    foreach (var (query, response) in healthcareQA)
    {
        // Escape quotes and newlines for CSV
        var escapedQuery = query.Replace("\"", "\"\"");
        var escapedResponse = response.Replace("\"", "\"\"").Replace("\r\n", " ").Replace("\n", " ");
        writer.WriteLine($"\"{escapedQuery}\",\"{escapedResponse}\"");
    }
}

// Prepare training data from CSV
await dataPreparer.PrepareFromCsvAsync(csvPath, systemPrompt);

// Split data into training and validation sets
await dataPreparer.SplitDataForValidationAsync(allDataPath, trainingDataPath, validationDataPath, 0.2);

Console.WriteLine("Training data preparation complete!");

Best Practices for Training Data

To create effective training data for fine-tuning, follow these best practices:

  1. Quality Over Quantity: A smaller set of high-quality examples is better than a large set of low-quality ones
  2. Diversity: Include a wide range of queries and scenarios your model will encounter
  3. Representativeness: Ensure your data represents real-world usage patterns
  4. Consistency: Maintain consistent formatting and style in responses
  5. Balance: Include a balanced distribution of different types of queries
  6. Validation Split: Always set aside a portion of your data for validation
  7. Ethical Considerations: Ensure (Content truncated due to size limit. Use line ranges to read in chunks)