Multimodal Input with Gemini

Gemini can process both text and images together in a single request, allowing for rich interactions like image analysis, visual Q&A, and more.

Typescript
Python

import { GoogleGenerativeAI } from "@google/generative-ai";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: "gemini-2.0-flash" });

// Load and prepare the image
const imageData = await fetch("car.jpg").then(res => res.blob());

const result = await model.generateContent([
  "Tell me about the car in this image.",
  {
    inlineData: {
      mimeType: "image/jpeg",
      data: await imageData.arrayBuffer()
    }
  }
]);

console.log(result.response.text());

from google import genai
from PIL import Image

# Initialize the model
model = genai.GenerativeModel("gemini-2.0-flash")

# Open an image file
image = Image.open("car.jpg")

# Generate content with both text and image
response = model.generate_content([
    "Tell me about the car in this image.",
    image
])

print(response.text)

Use Cases

Image analysis and description
Visual question answering
Document analysis with images
Scene understanding and object detection
Visual reasoning tasks

Response Format

The model returns a text response that incorporates understanding from both the text prompt and the image. The response is returned as a string containing the analysis or answer to your query.

Audio Transcription

Gemini can also process audio files to provide transcriptions and analysis. Here's how to work with audio:

Typescript
Python

import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";

// Initialize the file manager and Gemini
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY);
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: "gemini-2.0-flash" });

// Upload the audio file
const uploadResult = await fileManager.uploadFile(
  "path/to/audio.mp3",
  {
    mimeType: "audio/mp3",
    displayName: "Audio sample",
  }
);

// Wait for file processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
  await new Promise(resolve => setTimeout(resolve, 10000));
  file = await fileManager.getFile(uploadResult.file.name);
}

if (file.state === FileState.FAILED) {
  throw new Error("Audio processing failed.");
}

// Generate transcription
const result = await model.generateContent([
  "Please provide a detailed transcription of this audio.",
  {
    fileData: {
      fileUri: uploadResult.file.uri,
      mimeType: uploadResult.file.mimeType,
    },
  }
]);

console.log(result.response.text());

from google import genai
from google.genai import types

# Initialize the client
client = genai.Client()

# Read the audio file
with open("path/to/audio.mp3", "rb") as audio_file:
    audio_bytes = audio_file.read()

# Generate transcription
response = client.models.generate_content(
    model='gemini-2.0-flash',
    contents=[
        'Describe this audio clip',
        types.Part.from_bytes(
            data=audio_bytes,
            mime_type='audio/mp3',
        )
    ]
)

print(response.text)

Supported Audio Formats

Gemini supports the following audio formats:

WAV (audio/wav)
MP3 (audio/mp3)
AIFF (audio/aiff)
AAC (audio/aac)
OGG Vorbis (audio/ogg)
FLAC (audio/flac)

Technical Notes

Each second of audio is represented as 32 tokens
Maximum supported length is 9.5 hours per prompt
Audio is downsampled to 16 Kbps
Multiple channels are combined into a single channel
English-language speech is supported
Non-speech components (like music, ambient sounds) can be recognized

Use Cases​

Response Format​

Audio Transcription​

Supported Audio Formats​

Technical Notes​

Use Cases

Response Format

Audio Transcription

Supported Audio Formats

Technical Notes