Learn Before
Case Study

Designing an End-to-End Image Captioning System

Case context: You are tasked with building a machine learning system that takes a raw image as input and outputs a descriptive sentence in English. Your team suggests building a multi-stage pipeline, but you decide to apply end-to-end deep learning.

Question: Decide what specific type of data you must collect to train this end-to-end model, and explain the nature of the output the model will learn to produce directly.

Sample answer: To train this end-to-end model, I must collect a dataset of the right labeled input-output pairs, which in this case means raw images matched with their corresponding descriptive sentences. The model will then directly learn to produce a rich output—a complete sentence—rather than being limited to predicting a single number.

Key points:

  • Must collect correct labeled input-output pairs.
  • The model directly learns a rich output (a sentence).
  • The output is much more complex than a single number.

Rubric: The learner must identify the need for image-sentence labeled pairs and state that the model will learn to output a complex data structure (a sentence), demonstrating understanding of rich outputs.

0

1

Updated 2026-05-27

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L

Machine Learning

Deep Learning

Supervised Learning

Data Science

Machine Learning Strategy

Machine Learning Yearning @ DeepLearning.AI