Multimodal Input with Gemini
Gemini can process both text and images together in a single request, allowing for rich interactions like image analysis, visual Q&A, and more.
- Typescript
 - Python
 
import { GoogleGenerativeAI } from "@google/generative-ai";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: "gemini-2.0-flash" });
// Load and prepare the image
const imageData = await fetch("car.jpg").then(res => res.blob());
const result = await model.generateContent([
  "Tell me about the car in this image.",
  {
    inlineData: {
      mimeType: "image/jpeg",
      data: await imageData.arrayBuffer()
    }
  }
]);
console.log(result.response.text());
from google import genai
from PIL import Image
# Initialize the model
model = genai.GenerativeModel("gemini-2.0-flash")
# Open an image file
image = Image.open("car.jpg")
# Generate content with both text and image
response = model.generate_content([
    "Tell me about the car in this image.",
    image
])
print(response.text)
Use Cases
- Image analysis and description
 - Visual question answering
 - Document analysis with images
 - Scene understanding and object detection
 - Visual reasoning tasks
 
Response Format
The model returns a text response that incorporates understanding from both the text prompt and the image. The response is returned as a string containing the analysis or answer to your query.
Audio Transcription
Gemini can also process audio files to provide transcriptions and analysis. Here's how to work with audio:
- Typescript
 - Python
 
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";
// Initialize the file manager and Gemini
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY);
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: "gemini-2.0-flash" });
// Upload the audio file
const uploadResult = await fileManager.uploadFile(
  "path/to/audio.mp3",
  {
    mimeType: "audio/mp3",
    displayName: "Audio sample",
  }
);
// Wait for file processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
  await new Promise(resolve => setTimeout(resolve, 10000));
  file = await fileManager.getFile(uploadResult.file.name);
}
if (file.state === FileState.FAILED) {
  throw new Error("Audio processing failed.");
}
// Generate transcription
const result = await model.generateContent([
  "Please provide a detailed transcription of this audio.",
  {
    fileData: {
      fileUri: uploadResult.file.uri,
      mimeType: uploadResult.file.mimeType,
    },
  }
]);
console.log(result.response.text());
from google import genai
from google.genai import types
# Initialize the client
client = genai.Client()
# Read the audio file
with open("path/to/audio.mp3", "rb") as audio_file:
    audio_bytes = audio_file.read()
# Generate transcription
response = client.models.generate_content(
    model='gemini-2.0-flash',
    contents=[
        'Describe this audio clip',
        types.Part.from_bytes(
            data=audio_bytes,
            mime_type='audio/mp3',
        )
    ]
)
print(response.text)
Supported Audio Formats
Gemini supports the following audio formats:
- WAV (
audio/wav) - MP3 (
audio/mp3) - AIFF (
audio/aiff) - AAC (
audio/aac) - OGG Vorbis (
audio/ogg) - FLAC (
audio/flac) 
Technical Notes
- Each second of audio is represented as 32 tokens
 - Maximum supported length is 9.5 hours per prompt
 - Audio is downsampled to 16 Kbps
 - Multiple channels are combined into a single channel
 - English-language speech is supported
 - Non-speech components (like music, ambient sounds) can be recognized