1Cademy - Multimodal Chain-of-Thought Reasoning

Learn Before

Chain-of-Thought Prompting

Concept

Multimodal Chain-of-Thought Reasoning

Multimodal chain-of-thought reasoning extends step-by-step logical prompting to inputs that combine both text and images. Language models utilizing these multimodal inputs achieve higher reasoning accuracy compared to when they rely on text input alone.

Updated 2026-05-15

Contributors are:

Who are from:

References

Dive into Deep Learning

Tags

D2L

Dive into Deep Learning @ D2L

Example of a Model Initiating Chain-of-Thought Reasoning
A user is trying to solve a word problem using a large language model. They use two different prompts and receive two different responses.

Prompt 1: "A farmer has 5 pens, and each pen holds 8 chickens. The farmer buys 10 more chickens. How many chickens does the farmer have in total?" Response 1: "The farmer has 50 chickens in total."

Prompt 2: "A farmer has 5 pens, and each pen holds 8 chickens. The farmer buys 10 more chickens. How many chickens does the farmer have in total? Let
Modifying a Prompt for Step-by-Step Reasoning
Improving AI Reasoning for a Multi-Step Problem
Improving Chain-of-Thought Reasoning
Zero-Shot Chain-of-Thought Reasoning
Multimodal Chain-of-Thought Reasoning

Learn Before

Related