Multimodal Input with Gemini
Gemini can process both text and images together in a single request, allowing for rich interactions like image analysis, visual Q&A, and more.
- Typescript
- Python
import { GoogleGenerativeAI } from "@google/generative-ai";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: "gemini-2.0-flash" });
// Load and prepare the image
const imageData = await fetch("car.jpg").then(res => res.blob());
const result = await model.generateContent([
"Tell me about the car in this image.",
{
inlineData: {
mimeType: "image/jpeg",
data: await imageData.arrayBuffer()
}
}
]);
console.log(result.response.text());
from google import genai
from PIL import Image
# Initialize the model
model = genai.GenerativeModel("gemini-2.0-flash")
# Open an image file
image = Image.open("car.jpg")
# Generate content with both text and image
response = model.generate_content([
"Tell me about the car in this image.",
image
])
print(response.text)
Use Cases
- Image analysis and description
- Visual question answering
- Document analysis with images
- Scene understanding and object detection
- Visual reasoning tasks
Response Format
The model returns a text response that incorporates understanding from both the text prompt and the image. The response is returned as a string containing the analysis or answer to your query.
Audio Transcription
Gemini can also process audio files to provide transcriptions and analysis. Here's how to work with audio:
- Typescript
- Python
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";
// Initialize the file manager and Gemini
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY);
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: "gemini-2.0-flash" });
// Upload the audio file
const uploadResult = await fileManager.uploadFile(
"path/to/audio.mp3",
{
mimeType: "audio/mp3",
displayName: "Audio sample",
}
);
// Wait for file processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
await new Promise(resolve => setTimeout(resolve, 10000));
file = await fileManager.getFile(uploadResult.file.name);
}
if (file.state === FileState.FAILED) {
throw new Error("Audio processing failed.");
}
// Generate transcription
const result = await model.generateContent([
"Please provide a detailed transcription of this audio.",
{
fileData: {
fileUri: uploadResult.file.uri,
mimeType: uploadResult.file.mimeType,
},
}
]);
console.log(result.response.text());
from google import genai
from google.genai import types
# Initialize the client
client = genai.Client()
# Read the audio file
with open("path/to/audio.mp3", "rb") as audio_file:
audio_bytes = audio_file.read()
# Generate transcription
response = client.models.generate_content(
model='gemini-2.0-flash',
contents=[
'Describe this audio clip',
types.Part.from_bytes(
data=audio_bytes,
mime_type='audio/mp3',
)
]
)
print(response.text)
Supported Audio Formats
Gemini supports the following audio formats:
- WAV (
audio/wav
) - MP3 (
audio/mp3
) - AIFF (
audio/aiff
) - AAC (
audio/aac
) - OGG Vorbis (
audio/ogg
) - FLAC (
audio/flac
)
Technical Notes
- Each second of audio is represented as 32 tokens
- Maximum supported length is 9.5 hours per prompt
- Audio is downsampled to 16 Kbps
- Multiple channels are combined into a single channel
- English-language speech is supported
- Non-speech components (like music, ambient sounds) can be recognized