Multi-modal prompting: images, audio, structured inputs
How to prompt vision and audio models without losing the thread.
Image, audio, video, and structured inputs all route through the same API now. The prompt patterns that work for text mostly transfer — but there are specific gotchas per modality.
What "multi-modal" covers in 2026
- Vision: send images, get analysis or text output. GPT-4o, Claude Opus/Sonnet, Gemini — all handle images as first-class inputs.
- Audio: speech-to-text for dictation; real-time voice agents (GPT-4o, Gemini Live).
- Structured inputs: tables, PDFs with OCR, web pages — the model treats structured data as part of the prompt.
- Video: frame-by-frame understanding (Gemini's video support is strongest). Emerging, slower, expensive.
Vision prompting patterns that work
- Lead with the task, not the image. "Extract the total from this invoice" before the image attachment, not after.
- Be explicit about what to extract. "The amount in the 'Total Due' row, in USD, as a number" beats "the total."
- For multi-image prompts, label them. "Image 1: ... / Image 2: ..." — the model is better at cross-referencing when images have names.
- Zoom first, zoom second. For dense images (screenshots, whiteboards), the model sometimes misses detail at a single resolution. Crop to the region of interest, send both.
Audio prompting
- Transcription: ship fast with Whisper or Deepgram. Both are well past "good enough" for most use cases.
- Voice conversation (real-time): plan for much higher latency variance than text. Design the UX to tolerate 200ms-2s latency windows gracefully.
- Speaker diarization (who said what in a multi-speaker recording) is harder than it looks; use specialized APIs, not a general LLM.
Structured-input patterns
When you paste a table, a PDF, or a long doc:
- Mark the document boundaries explicitly.
<document name="Q3 Report"> ... </document>. The model uses the tag as a mental boundary. - Summarize metadata up top. "This is a 14-page quarterly financial report in PDF form, exported to markdown. Expect tables and footnotes."
- Ask specific questions. "What was the YoY revenue growth?" beats "Summarize this report."
Cost and latency
Vision is cheap per call but adds tokens fast. A single high-res image is ~1,000-2,000 input tokens. Video is 5-10× that per second. Audio transcription is priced separately from LLM tokens.
Budget accordingly. Real-time voice agents can rack up 10-20× the per-session cost of chat.
What breaks
- Small text in images. OCR via vision models is decent but not bulletproof for fonts below ~8px. Use dedicated OCR for extraction of dense text (receipts, forms).
- Charts and dashboards. Models often misread quantitative charts (getting the bar heights wrong). For any chart-based decision, either supply the underlying data as text, or double-check the model's reading.
- Occluded or rotated images. Models are surprisingly brittle to orientation. Rotate images upright before sending.
Check your understanding
2-question self-check
Optional. Your answers feed your knowledge score on the track certificate.
Q1.Best practice when sending an image with a task prompt is to…
Q2.For dense text in images (receipts, forms), the most reliable approach is usually…
Continue in this track
More lessons from Prompt Engineering Mastery.
Lesson 9
Prompt injection and how to defend against it
What prompt injection is, why it's hard, and what actually works.
Lesson 10
Capstone: a production-grade prompt from scratch
Assemble everything in a single, production-ready prompt with evals.
Lesson 12
Prompt caching: latency, cost, and correctness
What to cache, what to vary, and the failure modes cache introduces.