In machine learning _Frobenius_ and _L2_ norms can be used interchangeably. Nevertheless there is slight difference between these two. _L2_ norm in essence is an Euclidean norm with special case, when p=2. More specifically for n-dimensional vector we will have:
   $$L_2 = \sqrt {\textstyle\sum_{i=1}^n x_{i}^{2}}$$.
The _Frobenius_ measures the same thing but in this case we have matrices. That's why in neural networks _L2_ norm is frequently referred as _Forbenius_ norm.

_Frobenius_ and _L2_

In Ridge regression the coefficients and bias are learned using the same least-square criterion, but it adds a penalty for large variations in coefficients; i.e., coefficients are found by minimizing a tuning parameter - which controls the strength of the penalty term.
Once the parameters are learned, the ridge regression prediction formula is the same as OLS.
Ridge regression uses L2 regularization that minimizes the sum of square of coefficients and the influence of the regularization term is controlled by the $\alpha$ parameter. Higher $\alpha$ means more regularization and simpler models.
Use Ridge regression when the number of predictor variables is greater than the number of observations. Below is the formula found in our textbook. 
$$RSS_{RIDGE}(\beta_{j}, \beta_{0}) =
\sum_{i=1}^{n} ({y}_{i}-{\beta}_{0}-\sum_{j=1}^{p}{\beta}_{j}{x}_{ij})^2 +\lambda\sum_{j=1}^{p}{\beta}_{j}^2=RSS+\lambda\sum_{j=1}^{p}{\beta}_{j}^2.
$$ 

Note: Ridge Regression is sensitive to scales of variable. Therefore, we usually standardize the predictors before applying Ridge Regression.

Ridge Regression

What is weight decay?

$\lambda$ is called the regularization rate (parameter) which affects how much the model is affected by the regularization term and is one of the hyperparameters available for adjustment.

 $\lambda$: Regularization Rate in Deep Learning

The most common distribution that can be found in random variables dealing with ingenuous and manmade events is normal or gaussian distribution. Central limit theorem supports this assumption. Gaussian distribution is a continuous one.

Probability density function of a gaussian random variable is 
$$$$
$f_{X}(x) = \frac{1}{\sqrt{2\pi \sigma}} e^\frac{ -(x-\mu)^2} { 2 \sigma^2}$
$$$$
$ \sigma^2 = $variance of the distribution$ $
$\sigma = $standard deviation of the distribution$,\sigma \in (0,\infty)$
$ \mu  = $mean of the distribution$, \mu \in \mathbb{R} $ 


Gaussian (Normal) Distribution

L2 regularization, also known as weight decay, adds a term to the weight matrix that is equal to the sum of the squared weight values in the matrix, and is weighted by $\frac{\lambda}{2m}$. It is called weight decay because it penalizes growth of weights when minimizing the cost function. The prior distribution of L2 regularization is Gaussian distribution.
$$J(w, b) = \frac{1}{m}\sum_{i=1}^{m} \mathcal L (\hat{y}^{i}, y^{i}) + \frac{\lambda}{2m} ||w||_2^2$$

$$||w^{[l]}||^2 = \sum_{i=1}^{n^{[l]}} \sum_{j=1}^{n^{[l - 1]}} (w_{i, j}^{[l]})^2$$
The row $i$ of the weight matrix correspond to the neurons in the current layer $n^{[l]}$, whereas the columns $j$ of the weight matrix correspond to the neurons in the previous layer $n^{[l-1]}$.

University of Michigan - Ann Arbor

- L1 Regularization
- L2 Regularization
- Dropout Regularization
- Data Augmentation
- Early Stopping
- Tangent Distance
- Tangent Prop
- Manifold Tangent Classifier

Popular Regularization Techniques in Deep Learning

Goodfellow, I., Bengio, Y., & Courville, A. (2016). $\mathit{Deep \ Learning.}$ MIT Press. Retrieved from [www.deeplearningbook.org](https://www.deeplearningbook.org) 

Deep Learning

Getting more data may be costly and an ineffective way to prevent overfitting. However, you can augment your data to create more. We are adding synthetic data modified from our original data set. You essentially transform your data in a way where it is different form the original, but the data still fits into your given class. It is much more efficient and cost effective. 

Data Augmentation in Deep Learning

Plot training error and dev set error over the number of iterations. Then find at what number of iterations your neural network was doing best. You then take the value where that particular dev set was doing well. Rate W is likely to be mid-sized, and therefore you are picking a network with a smaller norm. You stop the training of your neural network early (hence early stopping).  

Early Stopping in Deep Learning

Unlike L2 or L1 regularization, Dropout regularization works by assigning a probability P, which is the probability that a given node is turned off for the current iteration of training. By doing this, you make the model simpler while training, which will reduce overfitting, while still gaining the benefits a larger model has during testing. Also, no single node will have excess influence on the model, and the model will spread the weights out, giving us a similar effect to L2 regularization. Generally, Dropout Regularization is the preferred form of regularization, as it not only accomplishes results similar to L2 regularization, it also provides some robustness to the model as each iteration is randomized.

Dropout Regularization in Deep Learning

L2 Regularization (Weight Decay) in Deep Learning

Which of these techniques are useful for reducing variance (reducing overfitting)?

L1 regularization is like L2 regularization, except instead of using the sum square of the weights in the penalty term, it uses the absolute value instead. It is not used as much as L2, because it encourages the weights to go to 0, instead of almost approaching zero, which leads to a compression of the model, which is often not desired. The prior distribution of L1 regularization is Laplace distribution.

$$J(w, b) = \frac{1}{m}\sum_{i=1}^{m} \mathcal L (\hat{y}^{i}, y^{i}) + \frac{\lambda}{m} ||w||_1$$
$$|w^{[l]}| = \sum_{i=1}^{n^{[l]}} |w_{i}^{[l]}|$$

L1 Regularization in Deep Learning

In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods.

ElasticNet Regression

If your Neural Network model seems to have high variance, what of the following would be promising things to try?

https://medium.com/analytics-vidhya/regularization-in-machine-learning-and-deep-learning-f5fa06a3e58a

Regularization in ML and DL

- Bagging is short for bootstrap aggregating.

- Bagging is a technique for reducing generalization error by combining several machine learning models.

- Bagging employs model averaging.

- This is powerful method of regularization, which is widely used in machine learning contest. However, it is quite impractical, since the computational cost of training several models in expensive.
 


Bagging in Deep Learning

- Instead of training multiple models like bagging, dropout train the ensemble consisting of all subnetworks that can be formed by removing non-output units from an underlying base network. These models share parameters, with each model inheriting a different subset of parameters from parent networks. 

- We need to specify the probability of unit to be included. Typically, the input unit is included with probability 0.8 while that in hidden units is 0.5.

-  Prediction of ensemble is given by geometric mean.


- Cheap computational cost and can be combined with other method or regularization.



Dropout in Deep Learning 

Normalization is one of the techniques used to reduce the bias and variance in statistical learning. It helps to have all the variables centered around 0 and have the same variance.
We subtract all the values of each variable from their mean and divide them by their variance.
$$\mu = \frac{1}{m} \sum_{i = 1}^{m} X^{(i)}$$
$$\sigma^{2} = \frac{1}{m} \sum_{i = 1}^{m} (X^{(i)} - \mu)^2$$
$$X_{New} = \frac{X - \mu}{\sigma}$$

Normalization of Data

- Early attempt to take advantage of the manifold hypothesis
- Nonparametric nearest neighbor algorithm, metric used is derived from the manifolds near which probability concentrates
- Assumes that data on the same manifold all has the same category
- Classifier should be invariant to local factors of variation that correspond to movement. So we use the nearest neighbor distance between two points, which is the distance between the manifolds they belong to
- Cheap alternative on a local level is to approximate each manifold by its tangent plane at a point and measure the distance between the two tangents, or between the tangent plane and a point, by solving a low-dimensional linear system.

Tangent Distance Algorithm

- Similar to tangent distance algorithm
- Closely related to dataset augmentation, both require that the model be invariant to certain specified directions of change in the input. Dataset augmentation is the non-infinitesimal version of tangent propagation
- Trains a neural net classifier with extra penalty to make each output of the neural net locally invariant to known factors of variation
- Factors correspond to movement along the manifold near which examples of the same class concentrate.
- Local invariance achieved by requiring $\triangledown_x  f(x)$  to be orthogonal to known manifold tangent vectors $v^{(i)}$ at $x$
- Equivalently, the directional derivative of $f$ at $x$ in the directions $v^{(i)}$ be small by adding a regularization penalty $\omega$, defined as: $\Omega(f) = \sum_{i} ((\triangledown_x f(x))^\top v^{(i)})^2$, which can be scaled by a hyperparameter, and for most neural networks, we would need to sum over many outputs
- Tangent vectors are derived a priori, usually from knowledge of the effect of transformations
- Has been used for supervised learning and reinforcement learning
- User encodes prior knowledge of task by specifying a set of transformations that should not alter the output, and analytically regularizes the model to resist perturbation in the directions corresponding to the specified transformation
- Only regularizes the model to resist infinitesimal perturbation, and poses difficulties for models based on rectified linear units
- Related to double backprop and adversarial training, both of which require that the model should be invariant to all directions of change in the input as long as the change is small
- Double backprop regularizes the Jacobian to be small
- Adversarial training finds inputs near the original inputs and trains the model to produce the same output on these as on the original inputs
- Adversarial training is the non-infinitesimal version of double backprop


Tangent Propagation Algorithm

Eliminates the need to know the tangent vectors a priori. Uses autoencoders to estimate manifold tangent vectors to avoid needing user-specified tangent vectors, which go beyond the classical invariants from the geometry of images and include factors that must be learned because they are object-specific. The algorithm first uses an autoencoder to learn the manifold structure by unsupervised learning, then uses the tangents to regularize a neural net classifier, similar to tangent propagation.

Manifold Tangent Classifier

- Boosting helps to build a strong ensemble( ensemble is a collection of multiple machine learning models) compared to a good capacity( strong learning model) individual machine learning model.

- In the context of deep learning all machine learning models are neural networks.

- Individual neural network is also considered as an ensemble which is improved by incrementally adding hidden layers.

- Output from the ensemble is combined to predict the output.

Boosting in Deep Learning

The no free lunch theorem is proof that no single regularization is strategy is certifiably the best at every task, however that is not to say that every strategy is created equal. There may be no best general strategy, but there may be strategies that are often times more useful than others. That being the case, here is a list of some regularization assumptions/ strategies that have been shown to be generally useful:

- Smoothness
- Linearity
- Multiple explanatory factors
- Causal factors
- Depth/ hierarchical factors
- Shared factors across tasks
- Manifolds
- Natural clustering
- Temporal and spatial coherence
- Sparsity
- Simplicity of factor dependencies.


Learn Before

Related

Learn After