Multi-modal prompting: images, audio, structured inputs

Image, audio, video, and structured inputs all route through the same API now. The prompt patterns that work for text mostly transfer — but there are specific gotchas per modality.

Vision: send images, get analysis or text output. GPT-4o, Claude Opus/Sonnet, Gemini — all handle images as first-class inputs.
Audio: speech-to-text for dictation; real-time voice agents (GPT-4o, Gemini Live).
Structured inputs: tables, PDFs with OCR, web pages — the model treats structured data as part of the prompt.
Video: frame-by-frame understanding (Gemini's video support is strongest). Emerging, slower, expensive.

Vision prompting patterns that work

Lead with the task, not the image. "Extract the total from this invoice" before the image attachment, not after.
Be explicit about what to extract. "The amount in the 'Total Due' row, in USD, as a number" beats "the total."
For multi-image prompts, label them. "Image 1: ... / Image 2: ..." — the model is better at cross-referencing when images have names.
Zoom first, zoom second. For dense images (screenshots, whiteboards), the model sometimes misses detail at a single resolution. Crop to the region of interest, send both.

Audio prompting

Transcription: ship fast with Whisper or Deepgram. Both are well past "good enough" for most use cases.
Voice conversation (real-time): plan for much higher latency variance than text. Design the UX to tolerate 200ms-2s latency windows gracefully.
Speaker diarization (who said what in a multi-speaker recording) is harder than it looks; use specialized APIs, not a general LLM.

Structured-input patterns

When you paste a table, a PDF, or a long doc:

Mark the document boundaries explicitly. <document name="Q3 Report"> ... </document>. The model uses the tag as a mental boundary.
Summarize metadata up top. "This is a 14-page quarterly financial report in PDF form, exported to markdown. Expect tables and footnotes."
Ask specific questions. "What was the YoY revenue growth?" beats "Summarize this report."

Cost and latency

Vision is cheap per call but adds tokens fast. A single high-res image is ~1,000-2,000 input tokens. Video is 5-10× that per second. Audio transcription is priced separately from LLM tokens.

Budget accordingly. Real-time voice agents can rack up 10-20× the per-session cost of chat.

What breaks

Small text in images. OCR via vision models is decent but not bulletproof for fonts below ~8px. Use dedicated OCR for extraction of dense text (receipts, forms).
Charts and dashboards. Models often misread quantitative charts (getting the bar heights wrong). For any chart-based decision, either supply the underlying data as text, or double-check the model's reading.
Occluded or rotated images. Models are surprisingly brittle to orientation. Rotate images upright before sending.

Image, audio, video, and structured inputs all route through the same API now. The prompt patterns that work for text mostly transfer — but there are specific gotchas per modality.

Vision: send images, get analysis or text output. GPT-4o, Claude Opus/Sonnet, Gemini — all handle images as first-class inputs.
Audio: speech-to-text for dictation; real-time voice agents (GPT-4o, Gemini Live).
Structured inputs: tables, PDFs with OCR, web pages — the model treats structured data as part of the prompt.
Video: frame-by-frame understanding (Gemini's video support is strongest). Emerging, slower, expensive.

Vision prompting patterns that work

Lead with the task, not the image. "Extract the total from this invoice" before the image attachment, not after.
Be explicit about what to extract. "The amount in the 'Total Due' row, in USD, as a number" beats "the total."
For multi-image prompts, label them. "Image 1: ... / Image 2: ..." — the model is better at cross-referencing when images have names.
Zoom first, zoom second. For dense images (screenshots, whiteboards), the model sometimes misses detail at a single resolution. Crop to the region of interest, send both.

Audio prompting

Transcription: ship fast with Whisper or Deepgram. Both are well past "good enough" for most use cases.
Voice conversation (real-time): plan for much higher latency variance than text. Design the UX to tolerate 200ms-2s latency windows gracefully.
Speaker diarization (who said what in a multi-speaker recording) is harder than it looks; use specialized APIs, not a general LLM.

Structured-input patterns

When you paste a table, a PDF, or a long doc:

Mark the document boundaries explicitly. <document name="Q3 Report"> ... </document>. The model uses the tag as a mental boundary.
Summarize metadata up top. "This is a 14-page quarterly financial report in PDF form, exported to markdown. Expect tables and footnotes."
Ask specific questions. "What was the YoY revenue growth?" beats "Summarize this report."

Cost and latency

Vision is cheap per call but adds tokens fast. A single high-res image is ~1,000-2,000 input tokens. Video is 5-10× that per second. Audio transcription is priced separately from LLM tokens.

Budget accordingly. Real-time voice agents can rack up 10-20× the per-session cost of chat.

What breaks

Small text in images. OCR via vision models is decent but not bulletproof for fonts below ~8px. Use dedicated OCR for extraction of dense text (receipts, forms).
Charts and dashboards. Models often misread quantitative charts (getting the bar heights wrong). For any chart-based decision, either supply the underlying data as text, or double-check the model's reading.
Occluded or rotated images. Models are surprisingly brittle to orientation. Rotate images upright before sending.

Multi-modal prompting: images, audio, structured inputs

Vision prompting patterns that work

Audio prompting

Structured-input patterns

Cost and latency

What breaks

2-question self-check

Continue in this track

Multi-modal prompting: images, audio, structured inputs

Vision prompting patterns that work

Audio prompting

Structured-input patterns

Cost and latency

What breaks

2-question self-check

Continue in this track

Multi-modal prompting: images, audio, structured inputs

What "multi-modal" covers in 2026

Vision prompting patterns that work

Audio prompting

Structured-input patterns

Cost and latency

What breaks

2-question self-check

Continue in this track

Multi-modal prompting: images, audio, structured inputs

What "multi-modal" covers in 2026

Vision prompting patterns that work

Audio prompting

Structured-input patterns

Cost and latency

What breaks

2-question self-check

Continue in this track