It is successful because:
- With each iteration, you are working with a smaller network
- It will tend to shrink the squared norm of the weights
- Shrinking these weights prevents overfitting

Why does dropout work?

https://jamesmccaffrey.wordpress.com/2019/05/07/neural-network-dropout-and-inverted-dropout/

Neural Network Dropout and Inverted Dropout

In modern implementations of dropout (often called inverted dropout), the activation tensor is not only multiplied by a binary mask but the remaining values are also rescaled. If elements are dropped out with probability $$p$$, the surviving elements are divided by $$1 - p$$. This rescaling step, performed during training, preserves the expected value of the activations and eliminates the need for scaling adjustments during the test phase.

Inverted Dropout Technique

  - The cost function isn't well-defined.
  - The cost can be really significant in the case of a complete system to the point where computational cost can outweigh its normal benefits
  - Dropout is also less effective when extremely few labeled training examples are available

Disadvantages of Dropout

For the variable P, as P gets larger, the regularization will decrease and the training error will be lower. Often times you will set P = 1 for certain layers that aren't overfitting as much in order to ensure you are keeping every unit in that layer.

Setting Probability P for Dropout Regularization in Deep Learning

When dropout is applied to a hidden layer of a neural network, each hidden unit is zeroed out with a specified probability $$p$$. This effectively creates a sub-network containing only a subset of the original neurons. Consequently, the forward calculation of the outputs and the backward calculation of gradients during backpropagation no longer depend on the dropped nodes. This mechanism ensures that the output layer cannot become overly dependent on any single hidden unit.

Mechanics of Dropout on a Hidden Layer

Typically, dropout is disabled during the test time phase. When evaluating a trained model on a new example, no nodes are dropped out. Thus, the full capacity of the network is utilized, and there is no need to normalize the outputs to account for missing activations.

Disabling Dropout at Test Time

Unlike L2 or L1 regularization, Dropout regularization works by assigning a probability P, which is the probability that a given node is turned off for the current iteration of training. By doing this, you make the model simpler while training, which will reduce overfitting, while still gaining the benefits a larger model has during testing. Also, no single node will have excess influence on the model, and the model will spread the weights out, giving us a similar effect to L2 regularization. Generally, Dropout Regularization is the preferred form of regularization, as it not only accomplishes results similar to L2 regularization, it also provides some robustness to the model as each iteration is randomized.

University of Michigan - Ann Arbor

Claude

- L1 Regularization
- L2 Regularization
- Dropout Regularization
- Data Augmentation
- Early Stopping
- Tangent Distance
- Tangent Prop
- Manifold Tangent Classifier

Popular Regularization Techniques in Deep Learning

Goodfellow, I., Bengio, Y., & Courville, A. (2016). $\mathit{Deep \ Learning.}$ MIT Press. Retrieved from [www.deeplearningbook.org](https://www.deeplearningbook.org) 

Deep Learning

Dive into Deep Learning

Getting more data may be costly and an ineffective way to prevent overfitting. However, you can augment your data to create more. We are adding synthetic data modified from our original data set. You essentially transform your data in a way where it is different form the original, but the data still fits into your given class. It is much more efficient and cost effective. 

Data Augmentation in Deep Learning

Early stopping is a classic regularization technique for deep neural networks that mitigates overfitting by constraining the number of training epochs instead of directly penalizing weight values. This approach is motivated by the fact that neural networks tend to fit clean data before memorizing noisy labels; by halting training at the optimal epoch, the model avoids interpolating noise and thereby improves generalization.

Early Stopping in Deep Learning

Dropout Regularization in Deep Learning

Which of these techniques are useful for reducing variance (reducing overfitting)?

In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods.

ElasticNet Regression

If your Neural Network model seems to have high variance, what of the following would be promising things to try?

https://medium.com/analytics-vidhya/regularization-in-machine-learning-and-deep-learning-f5fa06a3e58a

Regularization in ML and DL

- Bagging is short for bootstrap aggregating.

- Bagging is a technique for reducing generalization error by combining several machine learning models.

- Bagging employs model averaging.

- This is powerful method of regularization, which is widely used in machine learning contest. However, it is quite impractical, since the computational cost of training several models in expensive.
 


Bagging in Deep Learning

- Instead of training multiple models like bagging, dropout train the ensemble consisting of all subnetworks that can be formed by removing non-output units from an underlying base network. These models share parameters, with each model inheriting a different subset of parameters from parent networks. 

- We need to specify the probability of unit to be included. Typically, the input unit is included with probability 0.8 while that in hidden units is 0.5.

-  Prediction of ensemble is given by geometric mean.


- Cheap computational cost and can be combined with other method or regularization.



Dropout in Deep Learning 

Normalization is one of the techniques used to reduce the bias and variance in statistical learning. It helps to have all the variables centered around 0 and have the same variance.
We subtract all the values of each variable from their mean and divide them by their variance.
$$\mu = \frac{1}{m} \sum_{i = 1}^{m} X^{(i)}$$
$$\sigma^{2} = \frac{1}{m} \sum_{i = 1}^{m} (X^{(i)} - \mu)^2$$
$$X_{New} = \frac{X - \mu}{\sigma}$$

Normalization of Data

- Early attempt to take advantage of the manifold hypothesis
- Nonparametric nearest neighbor algorithm, metric used is derived from the manifolds near which probability concentrates
- Assumes that data on the same manifold all has the same category
- Classifier should be invariant to local factors of variation that correspond to movement. So we use the nearest neighbor distance between two points, which is the distance between the manifolds they belong to
- Cheap alternative on a local level is to approximate each manifold by its tangent plane at a point and measure the distance between the two tangents, or between the tangent plane and a point, by solving a low-dimensional linear system.

Tangent Distance Algorithm

- Similar to tangent distance algorithm
- Closely related to dataset augmentation, both require that the model be invariant to certain specified directions of change in the input. Dataset augmentation is the non-infinitesimal version of tangent propagation
- Trains a neural net classifier with extra penalty to make each output of the neural net locally invariant to known factors of variation
- Factors correspond to movement along the manifold near which examples of the same class concentrate.
- Local invariance achieved by requiring $\triangledown_x  f(x)$  to be orthogonal to known manifold tangent vectors $v^{(i)}$ at $x$
- Equivalently, the directional derivative of $f$ at $x$ in the directions $v^{(i)}$ be small by adding a regularization penalty $\omega$, defined as: $\Omega(f) = \sum_{i} ((\triangledown_x f(x))^\top v^{(i)})^2$, which can be scaled by a hyperparameter, and for most neural networks, we would need to sum over many outputs
- Tangent vectors are derived a priori, usually from knowledge of the effect of transformations
- Has been used for supervised learning and reinforcement learning
- User encodes prior knowledge of task by specifying a set of transformations that should not alter the output, and analytically regularizes the model to resist perturbation in the directions corresponding to the specified transformation
- Only regularizes the model to resist infinitesimal perturbation, and poses difficulties for models based on rectified linear units
- Related to double backprop and adversarial training, both of which require that the model should be invariant to all directions of change in the input as long as the change is small
- Double backprop regularizes the Jacobian to be small
- Adversarial training finds inputs near the original inputs and trains the model to produce the same output on these as on the original inputs
- Adversarial training is the non-infinitesimal version of double backprop


Tangent Propagation Algorithm

Eliminates the need to know the tangent vectors a priori. Uses autoencoders to estimate manifold tangent vectors to avoid needing user-specified tangent vectors, which go beyond the classical invariants from the geometry of images and include factors that must be learned because they are object-specific. The algorithm first uses an autoencoder to learn the manifold structure by unsupervised learning, then uses the tangents to regularize a neural net classifier, similar to tangent propagation.

Manifold Tangent Classifier

- Boosting helps to build a strong ensemble( ensemble is a collection of multiple machine learning models) compared to a good capacity( strong learning model) individual machine learning model.

- In the context of deep learning all machine learning models are neural networks.

- Individual neural network is also considered as an ensemble which is improved by incrementally adding hidden layers.

- Output from the ensemble is combined to predict the output.

Boosting in Deep Learning

The no free lunch theorem is proof that no single regularization is strategy is certifiably the best at every task, however that is not to say that every strategy is created equal. There may be no best general strategy, but there may be strategies that are often times more useful than others. That being the case, here is a list of some regularization assumptions/ strategies that have been shown to be generally useful:

- Smoothness
- Linearity
- Multiple explanatory factors
- Causal factors
- Depth/ hierarchical factors
- Shared factors across tasks
- Manifolds
- Natural clustering
- Temporal and spatial coherence
- Sparsity
- Simplicity of factor dependencies.


Appropriate Regularization/ Representation

Weight decay, commonly known as $$\ell_2$$ regularization, is a widely used technique for regularizing parametric machine learning models. Instead of directly manipulating the number of parameters, weight decay operates by restricting the values that the parameters can take. The technique is motivated by the intuition that the simplest function is $$f = 0$$, and the complexity of a linear function, such as $$f(\mathbf{x}) = \mathbf{w}^	op \mathbf{x}$$, can be measured by the distance of its parameters from zero.

Weight Decay

Unlike $$\ell_2$$ regularization which distributes weights evenly, $$\ell_1$$ regularization penalizes the absolute values of the weights. This leads to models that concentrate weights on a small set of features by clearing the other weights to zero, making it an effective method for feature selection. If a model only relies on a few features, it may eliminate the need to collect, store, or transmit data for the dropped features. Linear models that are $$\ell_1$$-regularized are popularly known as lasso regression.

Learn Before

Related

Learn After