The `SyntheticRegressionData` class encapsulates the procedural generation of synthetic datasets for linear regression models. Inheriting from a base data module (such as `d2l.DataModule`), its `__init__` constructor takes the true parameters $$\mathbf{w}$$ and $$b$$, alongside hyperparameters like `noise` (defaulting to `0.01`), `num_train`, `num_val`, and `batch_size`. After invoking `save_hyperparameters()`, it constructs the feature matrix $$\mathbf{X}$$ from a normal distribution and computes the label vector $$\mathbf{y}$$ using the formula $$\mathbf{y} = \mathbf{X} \mathbf{w} + b + \boldsymbol{\epsilon}$$. This structure standardizes the dataset preparation for training and validation.

SyntheticRegressionData Class

The two-dimensional linear model is among the simplest regression models encountered in machine learning. Its inherent simplicity makes it highly effective for testing the accuracy of regression algorithms, as it allows practitioners to avoid complicating factors such as insufficient amounts of data or dealing with an underdetermined system of equations.

Two-Dimensional Linear Model

To evaluate machine learning models, we often generate synthetic datasets where the underlying ground truth relationship is known. For a linear regression task, we can draw a design matrix $$\mathbf{X}$$ of features from a standard normal distribution. The corresponding labels $$\mathbf{y}$$ are computed by applying a ground truth linear function defined by true weights $$\mathbf{w}$$ and bias $$b$$, and then corrupting the output with additive noise $$\boldsymbol{\epsilon}$$ drawn from a normal distribution with mean $$\mu=0$$ and standard deviation $$\sigma = 0.01$$: $$\mathbf{y}= \mathbf{X} \mathbf{w} + b + \boldsymbol{\epsilon}$$. This procedure ensures that the generated labels simulate realistically observed data containing inherent random variation.

Claude

Linear regression is a linear approach to model a target prediction value based on independent variables.

Linear Regression

Dive into Deep Learning

Based on the number of independent variables used, there are two types of linear regression:
- Simple Linear Regression
- Multiple Linear Regression

Types of linear regression

Logistic regression is used to model discontinuous output choices - for example, when a variable is 'yes' or 'no' rather than a continuous value. In this case, logistic regression calculates the probability of a given categorical output, and decides on a threshold at which the model decides between categories to select for output. 

Logistic Regression

The terms of a linear model can be one of the following:
	
	- The Constant
	- A parameter multiplied by an independent variable (I.V)


Then, you build the equation by only adding the terms together. These rules limit the form to just one type:
Dependent variable = constant + parameter * IV + … + parameter * IV

Nonlinear regression models don’t fit this type.

Examples of nonlinear models include: 
	

When is a model considered linear?

This is a series of videos that introduces linear regression models and breaks down the process of linear regression.

[StatQuest with Josh Starmer]. (2020, February 21). Linear Regression and Linear Models [Video File]. Retrieved from [www.youtube.com/playlist?list=PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU](https://www.youtube.com/playlist?list=PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU) 

Linear Regression and Linear Models Videos

Here is an example of using R to do a simple linear regression. 
```R
#MASS package is a very large collection of data sets and #functions
library(MASS) 

#ISLR package includes the data sets associated with this #book
library(ISLR) 

lm.fit=lm(medv∼lstat,data=Boston) 
# medv as a response, lstat as a predictor

summary(lm.fit) # gives us pvalues and standard errors for the coeﬃcients, as well as the R2 statistic and F-statistic for the model

confint(lm.fit) # obtain a confidence interval

predict (lm.fit ,data.frame(lstat=c(5,10,15)), interval="confidence") #prediction in confidence intervals

predict (lm.fit ,data.frame(lstat=c(5,10,15)), interval="prediction") #prediction in prediction intervals

attach(Boston) # attach data to Boston

plot(lstat ,medv) 
abline(lm.fit ,lwd=3,col="red") 
# draw out the regression line
```
Expected output (plot and abline only):
![The expected output from the plot and abline commands](https://firebasestorage.googleapis.com/v0/b/onecademy-1.appspot.com/o/UploadedImages%2FWenfei-Tang_Wed%2C%2026%20Feb%202020%2020%3A51%3A58%20GMT.png?alt=media&token=5cf41866-c6fe-4732-88c5-2233fe889aa3) 

Example of Linear Regression Using R

We estimate the parameters $\beta_0, \beta_1, \beta_2, \dots, \beta_p$ such that we minimize the error in the following approximation:
$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon$$

Training or Fitting a linear regression model

The extent to which the linear regression model fits the data, the quality of linear regression fit, can be achieved by:
1. $ R^2 $
2. Residual Standard Error
3. Goodnesss of Fit
4. F Statistics
5. Mean Square Error (MSE)

Assessing the accuracy of linear regression

Linear regression is an example of a parametric method because it assumes a linear functional form for f(X). But if the specified functional form is far from the truth, and prediction accuracy is our goal, then the parametric method will perform poorly. This leads to higher bias. There is also a well known statistical theory behind linear regression.

KNN is an example of non parametric method that uses feature similarity. It is pretty useful, because in the real world, most of the practical data does not obey the typical theoretical assumptions made (eg gaussian mixtures, linearly separable etc). Non-parametric methods do not explicitly assume a parametric form for f(X), and thereby provide an alternative and more flexible approach for performing regression. This means it will have less bias and is more robust with highly non-linear settings. However, KNN regression is not as widely studied as linear regression.

Comparison of Linear Regression with K-Nearest Neighbors

- Linearity of the response-predictor relations
- Normal distribution of residuals
-  Error terms, e1,e2,...e3, are uncorrelated
-  Error terms have a constant variance
-  Little or no Multicollinearity between the features
-  Homoscedasticity Assumption
-  Large sample size and few or no outliers

Assumptions of Linear Regression

The regression coefficient is $$\beta_1$$, in the regression equation $$y \approx \beta_0 + \beta_1 X$$, is an esitmate of the rate of change (slope) between X and Y.

Regression Coefficient

It is easier to implement, interpret, and infer from the estimation of linear regression. However, one can rarely find a perfect linear relationship in the real world. Estimation using nonlinear regression is more difficult to implement, interpret, and infer from, but it gives us more predictive power by relaxing the assumption of linearity of the model.

Linear regression vs. nonlinear regression

- L1 Regularization
- L2 Regularization
- Dropout Regularization
- Data Augmentation
- Early Stopping
- Tangent Distance
- Tangent Prop
- Manifold Tangent Classifier

Popular Regularization Techniques in Deep Learning

We can use  polynomial future transformation to transform a problem into a higher dimensional regression space. Adding these extra polynomial features allows us a much richer set of complex functions that we can use to fit to the data. This is like allowing polynomials to be fit to the training data instead of simply a straight line, but using the same least-squares criterion that minimizes mean squared error. This approach of adding new features like polynomial features is also very effective with classification.

Polynomial Future Transformation

To make up for the under-fitting or over-fitting problems of ordinary linear regression models, locally weighted linear regression introduce weights to loss function. 
Its loss is transmitted to:
$J(\theta)=\sum_{i=1}^m w^i (y^{(i)} - \theta^T x^{(i)})^2$
w is used to represent the weights. Its value is calculated using Gaussian kernel function:
$w^i = exp(\frac{(x-x^i)^2}{-2k^2})$

Locally Weighted Linear Regression

In the context of linear regression, specific mathematical notation is used to systematically describe the dataset. Typically, $$n$$ denotes the total number of examples. Superscripts are utilized to enumerate individual samples and their corresponding targets, while subscripts are used to index their specific coordinates or features. For instance, $$\mathbf{x}^{(i)}$$ represents the complete $$i^{\textrm{th}}$$ sample, and $$x_j^{(i)}$$ designates the $$j^{\textrm{th}}$$ coordinate of that $$i^{\textrm{th}}$$ sample.

Linear Regression Dataset Notation

The foundational concepts of linear regression date back to the early 19th century, with significant contributions from mathematicians Carl Friedrich Gauss in 1809 and Adrien-Marie Legendre in 1805. Despite its historical age, linear regression remains one of the simplest and most popular standard tools for tackling modern regression problems.

History of Linear Regression

In the simplest case of one-dimensional inputs, a linear regression model attempts to fit a straight line through the observed data points. The fitness of this line depends purely on the choice of model parameters (the weight and the bias), which are adjusted to minimize the distance between the line's predictions and the true target values.

Linear Regression One-Dimensional Fit

The squared error is the standard loss function for regression problems, used to quantify the discrepancy between a single predicted value $$\hat{y}^{(i)}$$ and its true corresponding label $$y^{(i)}$$. The loss for a single example is defined mathematically as $$l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}ight)^2$$. The quadratic form heavily penalizes large differences, and the constant fraction $$\frac{1}{2}$$ is conventionally added because it cleanly cancels out when taking the derivative during optimization.

Squared Error Loss

A fundamental assumption of linear regression is that the relationship between the features $$\mathbf{x}$$ and the target $$y$$ is approximately linear. This means the expected value of the target, defined as the conditional mean $$E[Y \mid X=\mathbf{x}]$$, can be mathematically expressed as a weighted sum of the input features $$\mathbf{x}$$. This framework allows the actual target values to deviate from their expected conditional mean due to observation noise.

Linear Regression Conditional Mean Assumption

In a multiple-feature linear regression model, weights are parameters that determine the influence of each individual feature on the target prediction. When the model is expressed as a weighted sum, such as $$\hat{y} = w_1 x_1 + \dots + w_d x_d + b$$, each weight quantifies the specific contribution of its corresponding feature. In compact linear algebra notation, these parameters are collected into a weight vector $$\mathbf{w} \in \mathbb{R}^d$$.

Linear Regression Weight Parameters

In a linear regression model, the bias (also referred to as an offset or intercept) is a parameter that determines the baseline value of the model's estimate when all input features are exactly zero. The addition of the bias term $$b$$ is mathematically necessary because it permits the model to express general linear functions as an affine transformation, rather than restricting predictions to lines that must pass directly through the origin.

Linear Regression Bias Parameter

Linear regression can be conceptualized as a single-layer, fully connected neural network. In this representation, each given input feature $$x_1, \ldots, x_d$$ corresponds to an input neuron. Since the goal is to predict a single numerical value, all input neurons are directly connected to a single computed output neuron $$o_1$$. The total number of inputs $$d$$ is referred to as the feature dimensionality in the input layer.

Linear Regression as a Neural Network

Synthetic Data Generation for Linear Regression

Before attempting to use complex machine learning architectures for a regression task like predicting house prices, it is best practice to first train a simple linear model with squared loss. This linear model serves two critical functions: first, it provides a sanity check to verify that the dataset contains meaningful information (i.e., that the model can perform better than random guessing), and second, it establishes a performance baseline, giving researchers an intuition for how much additional gain can be expected from more sophisticated models.

Learn Before

Related

Learn After