- Data is divided 80% as training set, 10% as development set, and 10% as test set.
- Test set should be as large as possible, since a small test set may be accidentally unrepresentative.
- A good common practice is to find the smallest test set that gives us enough statistical power to measure a significant difference between models, since we also need the training set to include as much as data.

University of Michigan - Ann Arbor

- Training set: to train the model
- Validation set (development set): to tune the model parameters
- Test set: to test the performance of the model on unobserved data

Division of dataset in supervised statistical learning

An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Speech and Language Processing (3rd ed. draft) 

The observations that are used in a method to build a model for estimating the response variable.

Training data

The training and test data are generated by a probability distribution over datasets called the data-generating process. We typically make an assumptions that the examples in each dataset are independent from each other, and that the training set and test set are identically distributed, drawn from the same probability distribution as each other. This assumption enables us to describe the data-generating process with a probability distribution over a single example. 

The same distribution is then used to generate every train example and every test example. We call that shared underlying distribution the data-generating distribution, denoted $p_{data}$. 

This probabilistic framework and the assumption enable us to mathematically study the relationship between training error and test error.

Data-Generating Process and Data-Generating Distribution (in Machine Learning)

This function randomly shuffles the dataset and splits off a certain percentage of the input samples for use as a training set, and then puts the remaining samples into a different variable for use as a test set. So in this example, we're using a 75-25% split of training versus test data. And that's a pretty standard relative split that's used. It's a good rule of thumb to use in deciding what proportion of training versus test might be helpful.
```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
```
- Unless optional parameters test_size or train_size are passed in, the method will subset the training and test data randomly into 75% train:25% test
- random_state is an optional parameter used for reproducible output across multiple function calls

Train Test Split Function

Common Practices of Train/Test Set Arrangements in NLP

A test dataset is a separate partition of data that is held out from the learning process and used strictly to evaluate a model's performance. Because performing well on the training data does not guarantee success on unseen data, the test set serves as a critical measure of the model's true ability to generalize.

Test Dataset

A validation dataset (or validation set) is a separate partition of data used to assess the quality of a machine learning model's solution. After training a model iteratively on the training data, its predictive performance is evaluated on this distinct validation dataset to measure how well it has learned the underlying patterns. This evaluation is critical for automatically tuning hyperparameters without relying on the training set, ensuring the model generalizes appropriately to unseen data.

Learn Before

Related