Multimodal Prompting
Combine text, images, and code in prompts effectively to leverage AI vision and cross-modal reasoning.
The Problem
Most developers only use text prompts, ignoring the fact that modern AI models can process images, screenshots, diagrams, and visual context alongside text. A screenshot of a bug is worth a thousand words of description. A mockup image produces better UI code than paragraphs of layout specification. Not using multimodal input leaves significant AI capability on the table.
The Prompt
I am providing both text and visual input. Analyze them together.
TEXT CONTEXT:
[describe what the image shows and what you need]
IMAGE:
[attach screenshot, mockup, diagram, or error screenshot]
TASK:
[what you want the AI to do with both inputs]
VISUAL ANALYSIS INSTRUCTIONS:
- Describe what you see in the image before acting on it
- Reference specific visual elements (top-left button, error message in red, etc.)
- If the image quality is poor or details are unclear, state what you cannot determine
- Cross-reference visual elements with the text context
OUTPUT FORMAT:
[specify what you want back: code, analysis, description, fixes]
Example Output
VISUAL ANALYSIS:
The screenshot shows a React dashboard with a data table. I can see:
- A table with 5 columns (Name, Email, Role, Status, Actions)
- Row 3 shows a misaligned "Actions" dropdown that overflows outside the table container
- The status badges use inconsistent colors (row 1: green, row 4: also green but darker shade)
- The header row has no sort indicators despite the cursor showing a pointer on "Name"
ISSUES IDENTIFIED:
1. Dropdown overflow — the Actions menu needs `position: relative` on the parent cell
and `z-index: 10` on the dropdown
2. Inconsistent badge colors — the darker green (#16a34a) on row 4 should match
the lighter green (#22c55e) used on row 1
3. Missing sort affordance — add chevron icons to sortable column headers
CODE FIX for issue 1:
.table-cell-actions {
position: relative;
}
.actions-dropdown {
position: absolute;
right: 0;
z-index: 10;
}
When to Use
Use multimodal prompting when debugging UI issues from screenshots, implementing designs from mockups, analyzing error screens, reviewing charts or diagrams, or any task where visual context adds information that text alone cannot capture. It is particularly powerful for frontend development, design-to-code workflows, and visual bug reporting.
Pro Tips
- Ask the model to describe the image first — this confirms it sees what you see and catches misinterpretations before they cascade into wrong solutions.
- Annotate your screenshots — draw arrows or circles on the problem areas before uploading; this focuses the model’s attention on what matters.
- Combine mockup + existing code — provide both the design image and your current component code so the model can generate the diff, not a full rewrite.
- Use screenshots for error debugging — terminal error screenshots, browser console captures, and deployment logs as images can be faster than copy-pasting text.