Skip to main content

Multimodal Input with Gemini

Gemini can process both text and images together in a single request, allowing for rich interactions like image analysis, visual Q&A, and more.

import { GoogleGenerativeAI } from "@google/generative-ai";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: "gemini-2.0-flash" });

// Load and prepare the image
const imageData = await fetch("car.jpg").then(res => res.blob());

const result = await model.generateContent([
"Tell me about the car in this image.",
{
inlineData: {
mimeType: "image/jpeg",
data: await imageData.arrayBuffer()
}
}
]);

console.log(result.response.text());

Use Cases

  • Image analysis and description
  • Visual question answering
  • Document analysis with images
  • Scene understanding and object detection
  • Visual reasoning tasks

Response Format

The model returns a text response that incorporates understanding from both the text prompt and the image. The response is returned as a string containing the analysis or answer to your query.

Audio Transcription

Gemini can also process audio files to provide transcriptions and analysis. Here's how to work with audio:

import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";

// Initialize the file manager and Gemini
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY);
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: "gemini-2.0-flash" });

// Upload the audio file
const uploadResult = await fileManager.uploadFile(
"path/to/audio.mp3",
{
mimeType: "audio/mp3",
displayName: "Audio sample",
}
);

// Wait for file processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
await new Promise(resolve => setTimeout(resolve, 10000));
file = await fileManager.getFile(uploadResult.file.name);
}

if (file.state === FileState.FAILED) {
throw new Error("Audio processing failed.");
}

// Generate transcription
const result = await model.generateContent([
"Please provide a detailed transcription of this audio.",
{
fileData: {
fileUri: uploadResult.file.uri,
mimeType: uploadResult.file.mimeType,
},
}
]);

console.log(result.response.text());

Supported Audio Formats

Gemini supports the following audio formats:

  • WAV (audio/wav)
  • MP3 (audio/mp3)
  • AIFF (audio/aiff)
  • AAC (audio/aac)
  • OGG Vorbis (audio/ogg)
  • FLAC (audio/flac)

Technical Notes

  • Each second of audio is represented as 32 tokens
  • Maximum supported length is 9.5 hours per prompt
  • Audio is downsampled to 16 Kbps
  • Multiple channels are combined into a single channel
  • English-language speech is supported
  • Non-speech components (like music, ambient sounds) can be recognized