Summary: This post explores how to build multi-modal AI applications that can process both text and images using .NET and Azure AI Vision. Learn how to create applications that can understand and generate content across different modalities, enabling more natural and comprehensive AI experiences.
Introduction
Multi-modal AI represents a significant advancement in artificial intelligence, enabling systems to process and understand information across different modalities such as text, images, audio, and video. Unlike traditional AI systems that focus on a single modality, multi-modal AI can integrate and reason across these different forms of data, creating more comprehensive and natural experiences.
In this post, we’ll explore how to build multi-modal AI applications using .NET and Azure AI Vision. We’ll focus specifically on applications that can process both text and images, enabling scenarios like visual question answering, image captioning, and content generation based on both textual and visual inputs. By the end of this article, you’ll have the knowledge to create applications that can understand and generate content across different modalities.
Understanding Multi-Modal AI
Before diving into implementation, let’s understand the key concepts behind multi-modal AI.
What is Multi-Modal AI?
Multi-modal AI refers to artificial intelligence systems that can process and understand information from multiple types of inputs or “modalities.” Common modalities include:
- Text: Written language in various forms
- Images: Visual content including photos, diagrams, and artwork
- Audio: Sound, including speech and environmental sounds
- Video: Moving visual content with or without audio
- Sensor data: Information from various sensors like temperature, motion, etc.
Multi-modal AI systems can:
- Process each modality independently
- Integrate information across modalities
- Reason about relationships between different modalities
- Generate outputs that combine multiple modalities
Benefits of Multi-Modal AI
Multi-modal AI offers several advantages over single-modal approaches:
- More Complete Understanding: By processing multiple modalities, AI can gain a more comprehensive understanding of content.
- Reduced Ambiguity: Information from one modality can help clarify ambiguities in another.
- More Natural Interaction: Humans naturally communicate using multiple modalities; multi-modal AI enables more natural interaction.
- Broader Application Range: Multi-modal AI can address more complex use cases that involve different types of data.
- Improved Accessibility: Multi-modal systems can provide alternative ways to access information for users with different needs.
Common Multi-Modal AI Scenarios
Some common scenarios for multi-modal AI include:
- Visual Question Answering (VQA): Answering questions about images
- Image Captioning: Generating textual descriptions of images
- Text-to-Image Generation: Creating images based on textual descriptions
- Multi-Modal Search: Finding content based on queries that span multiple modalities
- Document Understanding: Extracting information from documents that contain both text and images
- Augmented Reality: Overlaying digital information on the physical world
Setting Up the Development Environment
Let’s start by setting up our development environment for building multi-modal AI applications.
Prerequisites
To follow along with this tutorial, you’ll need:
- Visual Studio 2022 or Visual Studio Code
- .NET 8 SDK
- An Azure subscription
- Access to Azure AI Vision (formerly Computer Vision)
- Access to Azure OpenAI Service
Creating a New Project
Let’s create a new .NET project for our multi-modal AI application:
bash
dotnet new webapi -n MultiModalAI
cd MultiModalAI
Installing Required Packages
Add the necessary packages to your project:
bash
dotnet add package Azure.AI.Vision.ImageAnalysis
dotnet add package Azure.AI.OpenAI
dotnet add package Microsoft.Extensions.Azure
dotnet add package Microsoft.Extensions.Configuration.Json
dotnet add package SixLabors.ImageSharp
Configuring Azure Services
Let’s set up the configuration for our Azure services:
json
// appsettings.json
{
"Logging": {
"LogLevel": {
"Default": "Information",
"Microsoft.AspNetCore": "Warning"
}
},
"AllowedHosts": "*",
"AzureAIVision": {
"Endpoint": "https://your-vision-service.cognitiveservices.azure.com/",
"Key": "your-vision-api-key"
},
"AzureOpenAI": {
"Endpoint": "https://your-openai-service.openai.azure.com/",
"Key": "your-openai-api-key",
"DeploymentName": "your-gpt-4-vision-deployment"
}
}
Registering Services
Register the Azure services in your Program.cs file:
csharp
// Program.cs
using Azure;
using Azure.AI.OpenAI;
using Azure.AI.Vision.ImageAnalysis;
using Microsoft.Extensions.Azure;
var builder = WebApplication.CreateBuilder(args );
// Add services to the container.
builder.Services.AddControllers();
builder.Services.AddEndpointsApiExplorer();
builder.Services.AddSwaggerGen();
// Add Azure clients
builder.Services.AddSingleton(sp =>
{
var configuration = sp.GetRequiredService<IConfiguration>();
var endpoint = new Uri(configuration["AzureOpenAI:Endpoint"]);
var key = configuration["AzureOpenAI:Key"];
return new OpenAIClient(endpoint, new AzureKeyCredential(key));
});
builder.Services.AddSingleton(sp =>
{
var configuration = sp.GetRequiredService<IConfiguration>();
var endpoint = configuration["AzureAIVision:Endpoint"];
var key = configuration["AzureAIVision:Key"];
return new ImageAnalysisClient(
new Uri(endpoint),
new AzureKeyCredential(key));
});
// Add application services
builder.Services.AddScoped<MultiModalService>();
var app = builder.Build();
// Configure the HTTP request pipeline.
if (app.Environment.IsDevelopment())
{
app.UseSwagger();
app.UseSwaggerUI();
}
app.UseHttpsRedirection();
app.UseAuthorization();
app.MapControllers();
app.Run();
Implementing Multi-Modal Services
Now, let’s implement the core services for our multi-modal AI application.
Creating Data Models
First, let’s define the data models for our application:
csharp
// Models/MultiModalRequest.cs
namespace MultiModalAI.Models
{
public class MultiModalRequest
{
public string? Text { get; set; }
public byte[]? ImageData { get; set; }
public string? ImageUrl { get; set; }
public string? Task { get; set; } = "analyze"; // analyze, caption, vqa, generate
public string? Question { get; set; } // For VQA tasks
}
}
// Models/MultiModalResponse.cs
namespace MultiModalAI.Models
{
public class MultiModalResponse
{
public string? Text { get; set; }
public List<Tag>? Tags { get; set; }
public List<DetectedObject>? Objects { get; set; }
public string? Caption { get; set; }
public string? Answer { get; set; }
public byte[]? GeneratedImageData { get; set; }
public string? RequestId { get; set; }
public DateTime Timestamp { get; set; }
}
public class Tag
{
public string Name { get; set; }
public double Confidence { get; set; }
}
public class DetectedObject
{
public string Name { get; set; }
public double Confidence { get; set; }
public BoundingBox BoundingBox { get; set; }
}
public class BoundingBox
{
public int X { get; set; }
public int Y { get; set; }
public int Width { get; set; }
public int Height { get; set; }
}
}