1Cademy - Designing an End-to-End Image Captioning System

Learn Before

End-to-End Learning

Case Study

Designing an End-to-End Image Captioning System

Case context: You are tasked with building a machine learning system that takes a raw image as input and outputs a descriptive sentence in English. Your team suggests building a multi-stage pipeline, but you decide to apply end-to-end deep learning.

Question: Decide what specific type of data you must collect to train this end-to-end model, and explain the nature of the output the model will learn to produce directly.

Sample answer: To train this end-to-end model, I must collect a dataset of the right labeled input-output pairs, which in this case means raw images matched with their corresponding descriptive sentences. The model will then directly learn to produce a rich output—a complete sentence—rather than being limited to predicting a single number.

Key points:

Must collect correct labeled input-output pairs.
The model directly learns a rich output (a sentence).
The output is much more complex than a single number.

Rubric: The learner must identify the need for image-sentence labeled pairs and state that the model will learn to output a complex data structure (a sentence), demonstrating understanding of rich outputs.

0

1

Updated 2026-05-27

Contributors are:

Who are from:

References

Learn Before

Related