Learn Before
Division of dataset in supervised statistical learning
Data-Generating Process and Data-Generating Distribution (in Machine Learning)
The training and test data are generated by a probability distribution over datasets called the data-generating process. We typically make an assumptions that the examples in each dataset are independent from each other, and that the training set and test set are identically distributed, drawn from the same probability distribution as each other. This assumption enables us to describe the data-generating process with a probability distribution over a single example.
The same distribution is then used to generate every train example and every test example. We call that shared underlying distribution the data-generating distribution, denoted .
This probabilistic framework and the assumption enable us to mathematically study the relationship between training error and test error.
0
1
Tags
Data Science
Related
Training data
Validation (Development) Set
Data-Generating Process and Data-Generating Distribution (in Machine Learning)
Train Test Split Function
Test Data
Common Practices of Train/Test Set Arrangements in NLP
Learn After
Training Error and Test Error