Implementing Multi-Modal AI Applications with .NET and Azure AI Vision

Summary: This post explores how to build multi-modal AI applications that can process both text and images using .NET and Azure AI Vision. Learn how to create applications that can understand and generate content across different modalities, enabling more natural and comprehensive AI experiences.

Introduction

Multi-modal AI represents a significant advancement in artificial intelligence, enabling systems to process and understand information across different modalities such as text, images, audio, and video. Unlike traditional AI systems that focus on a single modality, multi-modal AI can integrate and reason across these different forms of data, creating more comprehensive and natural experiences.

In this post, we’ll explore how to build multi-modal AI applications using .NET and Azure AI Vision. We’ll focus specifically on applications that can process both text and images, enabling scenarios like visual question answering, image captioning, and content generation based on both textual and visual inputs. By the end of this article, you’ll have the knowledge to create applications that can understand and generate content across different modalities.

Understanding Multi-Modal AI

Before diving into implementation, let’s understand the key concepts behind multi-modal AI.

What is Multi-Modal AI?

Multi-modal AI refers to artificial intelligence systems that can process and understand information from multiple types of inputs or “modalities.” Common modalities include:

Text: Written language in various forms
Images: Visual content including photos, diagrams, and artwork
Audio: Sound, including speech and environmental sounds
Video: Moving visual content with or without audio
Sensor data: Information from various sensors like temperature, motion, etc.

Multi-modal AI systems can:

Process each modality independently
Integrate information across modalities
Reason about relationships between different modalities
Generate outputs that combine multiple modalities

Benefits of Multi-Modal AI

Multi-modal AI offers several advantages over single-modal approaches:

More Complete Understanding: By processing multiple modalities, AI can gain a more comprehensive understanding of content.
Reduced Ambiguity: Information from one modality can help clarify ambiguities in another.
More Natural Interaction: Humans naturally communicate using multiple modalities; multi-modal AI enables more natural interaction.
Broader Application Range: Multi-modal AI can address more complex use cases that involve different types of data.
Improved Accessibility: Multi-modal systems can provide alternative ways to access information for users with different needs.

Common Multi-Modal AI Scenarios

Some common scenarios for multi-modal AI include:

Visual Question Answering (VQA): Answering questions about images
Image Captioning: Generating textual descriptions of images
Text-to-Image Generation: Creating images based on textual descriptions
Multi-Modal Search: Finding content based on queries that span multiple modalities
Document Understanding: Extracting information from documents that contain both text and images
Augmented Reality: Overlaying digital information on the physical world

Setting Up the Development Environment

Let’s start by setting up our development environment for building multi-modal AI applications.

Prerequisites

To follow along with this tutorial, you’ll need:

Visual Studio 2022 or Visual Studio Code
.NET 8 SDK
An Azure subscription
Access to Azure AI Vision (formerly Computer Vision)
Access to Azure OpenAI Service

Creating a New Project

Let’s create a new .NET project for our multi-modal AI application:

bash

dotnet new webapi -n MultiModalAI
cd MultiModalAI

Installing Required Packages

Add the necessary packages to your project:

bash

dotnet add package Azure.AI.Vision.ImageAnalysis
dotnet add package Azure.AI.OpenAI
dotnet add package Microsoft.Extensions.Azure
dotnet add package Microsoft.Extensions.Configuration.Json
dotnet add package SixLabors.ImageSharp

Configuring Azure Services

Let’s set up the configuration for our Azure services:

json

// appsettings.json
{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft.AspNetCore": "Warning"
    }
  },
  "AllowedHosts": "*",
  "AzureAIVision": {
    "Endpoint": "https://your-vision-service.cognitiveservices.azure.com/",
    "Key": "your-vision-api-key"
  },
  "AzureOpenAI": {
    "Endpoint": "https://your-openai-service.openai.azure.com/",
    "Key": "your-openai-api-key",
    "DeploymentName": "your-gpt-4-vision-deployment"
  }
}

Registering Services

csharp

// Program.cs
using Azure;
using Azure.AI.OpenAI;
using Azure.AI.Vision.ImageAnalysis;
using Microsoft.Extensions.Azure;

var builder = WebApplication.CreateBuilder(args );

// Add services to the container.
builder.Services.AddControllers();
builder.Services.AddEndpointsApiExplorer();
builder.Services.AddSwaggerGen();

// Add Azure clients
builder.Services.AddSingleton(sp =>
{
    var configuration = sp.GetRequiredService<IConfiguration>();
    var endpoint = new Uri(configuration["AzureOpenAI:Endpoint"]);
    var key = configuration["AzureOpenAI:Key"];
    return new OpenAIClient(endpoint, new AzureKeyCredential(key));
});

builder.Services.AddSingleton(sp =>
{
    var configuration = sp.GetRequiredService<IConfiguration>();
    var endpoint = configuration["AzureAIVision:Endpoint"];
    var key = configuration["AzureAIVision:Key"];
    return new ImageAnalysisClient(
        new Uri(endpoint),
        new AzureKeyCredential(key));
});

// Add application services
builder.Services.AddScoped<MultiModalService>();

var app = builder.Build();

// Configure the HTTP request pipeline.
if (app.Environment.IsDevelopment())
{
    app.UseSwagger();
    app.UseSwaggerUI();
}

app.UseHttpsRedirection();
app.UseAuthorization();
app.MapControllers();

app.Run();

Implementing Multi-Modal Services

Now, let’s implement the core services for our multi-modal AI application.

Creating Data Models

First, let’s define the data models for our application:

csharp

// Models/MultiModalRequest.cs
namespace MultiModalAI.Models
{
    public class MultiModalRequest
    {
        public string? Text { get; set; }
        public byte[]? ImageData { get; set; }
        public string? ImageUrl { get; set; }
        public string? Task { get; set; } = "analyze"; // analyze, caption, vqa, generate
        public string? Question { get; set; } // For VQA tasks
    }
}

// Models/MultiModalResponse.cs
namespace MultiModalAI.Models
{
    public class MultiModalResponse
    {
        public string? Text { get; set; }
        public List<Tag>? Tags { get; set; }
        public List<DetectedObject>? Objects { get; set; }
        public string? Caption { get; set; }
        public string? Answer { get; set; }
        public byte[]? GeneratedImageData { get; set; }
        public string? RequestId { get; set; }
        public DateTime Timestamp { get; set; }
    }

    public class Tag
    {
        public string Name { get; set; }
        public double Confidence { get; set; }
    }

    public class DetectedObject
    {
        public string Name { get; set; }
        public double Confidence { get; set; }
        public BoundingBox BoundingBox { get; set; }
    }

    public class BoundingBox
    {
        public int X { get; set; }
        public int Y { get; set; }
        public int Width { get; set; }
        public int Height { get; set; }
    }
}