1Cademy - Data Sampling Notation from a Distribution

Learn Before

Data-Generating Process and Data-Generating Distribution (in Machine Learning)

Formula

Data Sampling Notation from a Distribution

The notation (x, y) sim D specifies that a data sample, represented by the tuple (x, y), is drawn from a probability distribution or dataset denoted by $D$ . The tilde symbol ( $\sim$ ) signifies 'is distributed as' or 'is drawn from'.

Updated 2026-06-25

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn After

A machine learning engineer is preparing a dataset for a supervised image classification task. Each data point consists of an image and its corresponding correct label (e.g., 'cat', 'dog'). The entire collection of these image-label pairs forms the dataset. Which of the following notations correctly expresses the action of drawing a single, complete data sample (represented by an image x and its label y) from the overall data distribution D?
Correcting Data Sampling Notation
A research team is training a machine learning model to translate text from one language to another. Their dataset, denoted as D, consists of a large collection of sentence pairs, where each pair contains a sentence in the source language and its correct translation in the target language. Which notation accurately represents the process of drawing a single, complete training example (a source sentence x and its corresponding target sentence y) from this dataset?

Learn Before

Related

Learn After