Linear regression uniquely presents an optimization problem with a direct analytic solution. By subsuming the bias term into the weight vector $$\mathbf{w}$$—which is done by appending a column of all $$1$$s to the design matrix $$\mathbf{X}$$—the training objective becomes minimizing $$\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$$. Setting the derivative of this loss with respect to $$\mathbf{w}$$ equal to $$0$$ yields the intermediate derivative formula $$\partial_{\mathbf{w}} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 = 2 \mathbf{X}^	op (\mathbf{X} \mathbf{w} - \mathbf{y}) = 0$$, which directly gives the normal equation $$\mathbf{X}^	op \mathbf{y} = \mathbf{X}^	op \mathbf{X} \mathbf{w}$$. Solving for $$\mathbf{w}$$ provides the optimal parameters: $$\mathbf{w}^* = (\mathbf{X}^	op \mathbf{X})^{-1}\mathbf{X}^	op \mathbf{y}$$. This closed-form solution exists only if the design matrix has full rank, ensuring that the matrix $$\mathbf{X}^	op \mathbf{X}$$ is invertible.

Linear Regression Analytic Solution

When predicting labels for an entire dataset consisting of $$n$$ examples, it is mathematically convenient to organize the features into a design matrix $$\mathbf{X} \in \mathbb{R}^{n 	imes d}$$, where each row represents an example and each column a feature. The vector of predictions $$\hat{\mathbf{y}} \in \mathbb{R}^n$$ for all examples can then be concisely computed using a matrix-vector product with the weight vector $$\mathbf{w}$$, adding the bias term $$b$$ via broadcasting: $$\hat{\mathbf{y}} = \mathbf{X} \mathbf{w} + b$$.

Claude

In the context of linear regression, specific mathematical notation is used to systematically describe the dataset. Typically, $$n$$ denotes the total number of examples. Superscripts are utilized to enumerate individual samples and their corresponding targets, while subscripts are used to index their specific coordinates or features. For instance, $$\mathbf{x}^{(i)}$$ represents the complete $$i^{\textrm{th}}$$ sample, and $$x_j^{(i)}$$ designates the $$j^{\textrm{th}}$$ coordinate of that $$i^{\textrm{th}}$$ sample.

Learn Before

Related

Learn After