Learn Before
  • Division of dataset in supervised statistical learning

Data-Generating Process and Data-Generating Distribution (in Machine Learning)

The training and test data are generated by a probability distribution over datasets called the data-generating process. We typically make an assumptions that the examples in each dataset are independent from each other, and that the training set and test set are identically distributed, drawn from the same probability distribution as each other. This assumption enables us to describe the data-generating process with a probability distribution over a single example.

The same distribution is then used to generate every train example and every test example. We call that shared underlying distribution the data-generating distribution, denoted pdatap_{data}.

This probabilistic framework and the assumption enable us to mathematically study the relationship between training error and test error.

0

1

2 months ago

Tags

Data Science

Foundations of Large Language Models Course

Computing Sciences

Related
  • Training data

  • Validation (Development) Set

  • Data-Generating Process and Data-Generating Distribution (in Machine Learning)

  • Train Test Split Function

  • Test Data

  • Common Practices of Train/Test Set Arrangements in NLP

Learn After
  • Training Error and Test Error

  • Data Sampling Notation from a Distribution

  • Conditional Probability of Pairwise Preference

  • A team develops a model to predict customer churn using historical data from 2019-2021. The model performs exceptionally well on a portion of this historical data set aside for testing. However, when deployed to predict churn for customers in 2023, its performance is poor. A major new loyalty program was introduced at the beginning of 2023, altering customer retention patterns. Which of the following statements best analyzes the most likely reason for this discrepancy?

  • A data scientist is tasked with building a model to predict real estate prices for an entire metropolitan area. To do this, they must create a training set and a test set. Which of the following data collection and splitting strategies presents the most significant risk of violating the fundamental assumption that both datasets are drawn from the same underlying probability distribution?

  • Evaluating Data Sourcing for a Spam Filter