Implementing Multi-Modal AI Applications with .NET and Azure AI Vision

Summary: This post explores how to build multi-modal AI applications that can process both text and images using .NET and Azure AI Vision. Learn how to create applications that can understand and generate content across different modalities, enabling more natural and comprehensive AI experiences.

Introduction

Multi-modal AI represents a significant advancement in artificial intelligence, enabling systems to process and understand information across different modalities such as text, images, audio, and video. Unlike traditional AI systems that focus on a single modality, multi-modal AI can integrate and reason across these different forms of data, creating more comprehensive and natural experiences.

In this post, we’ll explore how to build multi-modal AI applications using .NET and Azure AI Vision. We’ll focus specifically on applications that can process both text and images, enabling scenarios like visual question answering, image captioning, and content generation based on both textual and visual inputs. By the end of this article, you’ll have the knowledge to create applications that can understand and generate content across different modalities.

Understanding Multi-Modal AI

Before diving into implementation, let’s understand the key concepts behind multi-modal AI.

What is Multi-Modal AI?

Multi-modal AI refers to artificial intelligence systems that can process and understand information from multiple types of inputs or “modalities.” Common modalities include:

  • Text: Written language in various forms
  • Images: Visual content including photos, diagrams, and artwork
  • Audio: Sound, including speech and environmental sounds
  • Video: Moving visual content with or without audio
  • Sensor data: Information from various sensors like temperature, motion, etc.

Multi-modal AI systems can:

  1. Process each modality independently
  2. Integrate information across modalities
  3. Reason about relationships between different modalities
  4. Generate outputs that combine multiple modalities

Benefits of Multi-Modal AI

Multi-modal AI offers several advantages over single-modal approaches:

  1. More Complete Understanding: By processing multiple modalities, AI can gain a more comprehensive understanding of content.
  2. Reduced Ambiguity: Information from one modality can help clarify ambiguities in another.
  3. More Natural Interaction: Humans naturally communicate using multiple modalities; multi-modal AI enables more natural interaction.
  4. Broader Application Range: Multi-modal AI can address more complex use cases that involve different types of data.
  5. Improved Accessibility: Multi-modal systems can provide alternative ways to access information for users with different needs.

Common Multi-Modal AI Scenarios

Some common scenarios for multi-modal AI include:

  • Visual Question Answering (VQA): Answering questions about images
  • Image Captioning: Generating textual descriptions of images
  • Text-to-Image Generation: Creating images based on textual descriptions
  • Multi-Modal Search: Finding content based on queries that span multiple modalities
  • Document Understanding: Extracting information from documents that contain both text and images
  • Augmented Reality: Overlaying digital information on the physical world

Setting Up the Development Environment

Let’s start by setting up our development environment for building multi-modal AI applications.

Prerequisites

To follow along with this tutorial, you’ll need:

  • Visual Studio 2022 or Visual Studio Code
  • .NET 8 SDK
  • An Azure subscription
  • Access to Azure AI Vision (formerly Computer Vision)
  • Access to Azure OpenAI Service

Creating a New Project

Let’s create a new .NET project for our multi-modal AI application:

bash

dotnet new webapi -n MultiModalAI
cd MultiModalAI

Installing Required Packages

Add the necessary packages to your project:

bash

dotnet add package Azure.AI.Vision.ImageAnalysis
dotnet add package Azure.AI.OpenAI
dotnet add package Microsoft.Extensions.Azure
dotnet add package Microsoft.Extensions.Configuration.Json
dotnet add package SixLabors.ImageSharp

Configuring Azure Services

Let’s set up the configuration for our Azure services:

json

// appsettings.json
{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft.AspNetCore": "Warning"
    }
  },
  "AllowedHosts": "*",
  "AzureAIVision": {
    "Endpoint": "https://your-vision-service.cognitiveservices.azure.com/",
    "Key": "your-vision-api-key"
  },
  "AzureOpenAI": {
    "Endpoint": "https://your-openai-service.openai.azure.com/",
    "Key": "your-openai-api-key",
    "DeploymentName": "your-gpt-4-vision-deployment"
  }
}

Registering Services

Register the Azure services in your Program.cs file:

csharp

// Program.cs
using Azure;
using Azure.AI.OpenAI;
using Azure.AI.Vision.ImageAnalysis;
using Microsoft.Extensions.Azure;

var builder = WebApplication.CreateBuilder(args );

// Add services to the container.
builder.Services.AddControllers();
builder.Services.AddEndpointsApiExplorer();
builder.Services.AddSwaggerGen();

// Add Azure clients
builder.Services.AddSingleton(sp =>
{
    var configuration = sp.GetRequiredService<IConfiguration>();
    var endpoint = new Uri(configuration["AzureOpenAI:Endpoint"]);
    var key = configuration["AzureOpenAI:Key"];
    return new OpenAIClient(endpoint, new AzureKeyCredential(key));
});

builder.Services.AddSingleton(sp =>
{
    var configuration = sp.GetRequiredService<IConfiguration>();
    var endpoint = configuration["AzureAIVision:Endpoint"];
    var key = configuration["AzureAIVision:Key"];
    return new ImageAnalysisClient(
        new Uri(endpoint),
        new AzureKeyCredential(key));
});

// Add application services
builder.Services.AddScoped<MultiModalService>();

var app = builder.Build();

// Configure the HTTP request pipeline.
if (app.Environment.IsDevelopment())
{
    app.UseSwagger();
    app.UseSwaggerUI();
}

app.UseHttpsRedirection();
app.UseAuthorization();
app.MapControllers();

app.Run();

Implementing Multi-Modal Services

Now, let’s implement the core services for our multi-modal AI application.

Creating Data Models

First, let’s define the data models for our application:

csharp

// Models/MultiModalRequest.cs
namespace MultiModalAI.Models
{
    public class MultiModalRequest
    {
        public string? Text { get; set; }
        public byte[]? ImageData { get; set; }
        public string? ImageUrl { get; set; }
        public string? Task { get; set; } = "analyze"; // analyze, caption, vqa, generate
        public string? Question { get; set; } // For VQA tasks
    }
}

// Models/MultiModalResponse.cs
namespace MultiModalAI.Models
{
    public class MultiModalResponse
    {
        public string? Text { get; set; }
        public List<Tag>? Tags { get; set; }
        public List<DetectedObject>? Objects { get; set; }
        public string? Caption { get; set; }
        public string? Answer { get; set; }
        public byte[]? GeneratedImageData { get; set; }
        public string? RequestId { get; set; }
        public DateTime Timestamp { get; set; }
    }

    public class Tag
    {
        public string Name { get; set; }
        public double Confidence { get; set; }
    }

    public class DetectedObject
    {
        public string Name { get; set; }
        public double Confidence { get; set; }
        public BoundingBox BoundingBox { get; set; }
    }

    public class BoundingBox
    {
        public int X { get; set; }
        public int Y { get; set; }
        public int Width { get; set; }
        public int Height { get; set; }
    }
}