In a Feed-forward NN, we use the linear matrix multiplication for the forward propagation and each element in the output is dependent on each element in the input. Now imagine the input is a 500 x 500 image - including the three image channels, we have the input of 3 x 500 x 500 = 750,000 dimensions. Considering this number as only one side of the parameter matrix in the first layer, this means a lot of parameters in the first layer.

In contrast, in a convolution layer, each output value only depends on a small number of inputs, allowing us to significantly improve on efficiency, and decrease memory requirements. This way, we will be able to use very large input images.

Sparsity of Connections in Convolutional Neural Networks

Causal inference refers to making conclusions about finding the reasonings behind the relationship of two or more variables. After coming to some conclusions about statistical data, we ask why such relationships exist between variables.

Causal Inference

In order to fit a smooth curve of a set of data, we need to find a function g(x) to fit the observed data well. We have to make g to be able to keep $RSS = \sum_{i=1}^{n} (y_i – g(x_i))^2$ as small as possible and the curve as smooth as possible. 

Minimize function g: Loss & Penalty; $\lambda$ is tuning parameter
$\sum_{i=1}^{n} (y_i – g(x_i))^2 + \lambda \int g^n(t)^2 \, dt$


Smoothing Splines

Clustering tries to place unlabeled samples into groups based on similar characteristics. 
An example, these samples can be distinguished into 5 categories by some observed characteristic. Let's say we are clustering textbook samples by different school subjects (math, science, history, english, art). 
![The image from the cited reference](https://firebasestorage.googleapis.com/v0/b/onecademy-1.appspot.com/o/UploadedImages%2FMadeline-Rosenberg_Wed%2C%2019%20Feb%202020%2015%3A27%3A54%20GMT.png?alt=media&token=5b890a1d-803b-4ad5-a65f-1ec4043225c2) 

Clustering, an unsupervised statistical learning method

 Good at finding low dimensional structure in high dimensional data 
Methods:
Multidimensional scaling (MDS): attempts to find a distance-preserving low-dimensional projection
Preserves information about how points in original dataspace are close to each other


Manifold learning algorithms

A feedforward network with a single layer is suﬃcient to representany function, but the layer may be infeasibly large and may fail to learn and generalize correctly. In many circumstances, using deeper models can reduce the number of units required to represent the desired function and can reduce the amount of generalization error.

As is shown in the figure, empirical results show that deeper networks generalize better when used to transcribe multidigit numbers from photographs of addresses. Data from Goodfellow et al. (2014d). The test set accuracy consistently increases with increasing depth.


Effect of Depth for Neural Networks

While a parameter norm penalty is one way to regularize parameters to be close to one another, the more popular way is to use constraints: to force sets of parameters to be equal. This method of regularization is often referred to as parameter sharing, because we interpret the various models or model components as sharing a unique set of parameters. A signiﬁcant advantage of parameter sharing over regularizing the parameters to be close (via a norm penalty) is that only a subset of the parameters (the unique set) needs to be stored in memory. In certain models—such as the convolutional neural network—this can lead to signiﬁcant reduction in the memory footprint of the model.



	Parameter sharing: to force sets of parameters to be equal

	Main application: Convolutional Neural Network (CNN)

	Advantages: signiﬁcant reduction in the memory footprint of the model




Parameter Sharing

The idea that the exact value of any underlying factors that created a given input may change over time or distance. That being the case, any difference in time or distance between when the input was captured compared to the training output may result in some noise corresponding to the exact space-time discrepancy. As such, it may be that the underlying factors that caused the input are easier to determine than the precise output value that was recorded.

Temporal and Spatial Coherence

The assumption that the relationships between underlying factors are simple to express. As an example, it is common to assume that these factors are independent of each other. This results in the probability of any factor set $h$ is simply equal to the product of the probabilities for each  factor state $h_i$.
$$P(h) = \prod P(h_i)$$
Many machine learning algorithms assume either this independence or some other simple/ easy to calculate relationship.

Simplicity of Factor Dependencies

The no free lunch theorem is proof that no single regularization is strategy is certifiably the best at every task, however that is not to say that every strategy is created equal. There may be no best general strategy, but there may be strategies that are often times more useful than others. That being the case, here is a list of some regularization assumptions/ strategies that have been shown to be generally useful:

- Smoothness
- Linearity
- Multiple explanatory factors
- Causal factors
- Depth/ hierarchical factors
- Shared factors across tasks
- Manifolds
- Natural clustering
- Temporal and spatial coherence
- Sparsity
- Simplicity of factor dependencies.


University of Toledo

- L1 Regularization
- L2 Regularization
- Dropout Regularization
- Data Augmentation
- Early Stopping
- Tangent Distance
- Tangent Prop
- Manifold Tangent Classifier

Popular Regularization Techniques in Deep Learning

Goodfellow, I., Bengio, Y., & Courville, A. (2016). $\mathit{Deep \ Learning.}$ MIT Press. Retrieved from [www.deeplearningbook.org](https://www.deeplearningbook.org) 

Deep Learning

Getting more data may be costly and an ineffective way to prevent overfitting. However, you can augment your data to create more. We are adding synthetic data modified from our original data set. You essentially transform your data in a way where it is different form the original, but the data still fits into your given class. It is much more efficient and cost effective. 

Data Augmentation in Deep Learning

Early stopping is a classic regularization technique for deep neural networks that mitigates overfitting by constraining the number of training epochs instead of directly penalizing weight values. This approach is motivated by the fact that neural networks tend to fit clean data before memorizing noisy labels; by halting training at the optimal epoch, the model avoids interpolating noise and thereby improves generalization.

Early Stopping in Deep Learning

Unlike L2 or L1 regularization, Dropout regularization works by assigning a probability P, which is the probability that a given node is turned off for the current iteration of training. By doing this, you make the model simpler while training, which will reduce overfitting, while still gaining the benefits a larger model has during testing. Also, no single node will have excess influence on the model, and the model will spread the weights out, giving us a similar effect to L2 regularization. Generally, Dropout Regularization is the preferred form of regularization, as it not only accomplishes results similar to L2 regularization, it also provides some robustness to the model as each iteration is randomized.

Dropout Regularization in Deep Learning

Which of these techniques are useful for reducing variance (reducing overfitting)?

In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods.

ElasticNet Regression

If your Neural Network model seems to have high variance, what of the following would be promising things to try?

https://medium.com/analytics-vidhya/regularization-in-machine-learning-and-deep-learning-f5fa06a3e58a

Regularization in ML and DL

- Bagging is short for bootstrap aggregating.

- Bagging is a technique for reducing generalization error by combining several machine learning models.

- Bagging employs model averaging.

- This is powerful method of regularization, which is widely used in machine learning contest. However, it is quite impractical, since the computational cost of training several models in expensive.
 


Bagging in Deep Learning

- Instead of training multiple models like bagging, dropout train the ensemble consisting of all subnetworks that can be formed by removing non-output units from an underlying base network. These models share parameters, with each model inheriting a different subset of parameters from parent networks. 

- We need to specify the probability of unit to be included. Typically, the input unit is included with probability 0.8 while that in hidden units is 0.5.

-  Prediction of ensemble is given by geometric mean.


- Cheap computational cost and can be combined with other method or regularization.



Dropout in Deep Learning 

Normalization is one of the techniques used to reduce the bias and variance in statistical learning. It helps to have all the variables centered around 0 and have the same variance.
We subtract all the values of each variable from their mean and divide them by their variance.
$$\mu = \frac{1}{m} \sum_{i = 1}^{m} X^{(i)}$$
$$\sigma^{2} = \frac{1}{m} \sum_{i = 1}^{m} (X^{(i)} - \mu)^2$$
$$X_{New} = \frac{X - \mu}{\sigma}$$

Normalization of Data

- Early attempt to take advantage of the manifold hypothesis
- Nonparametric nearest neighbor algorithm, metric used is derived from the manifolds near which probability concentrates
- Assumes that data on the same manifold all has the same category
- Classifier should be invariant to local factors of variation that correspond to movement. So we use the nearest neighbor distance between two points, which is the distance between the manifolds they belong to
- Cheap alternative on a local level is to approximate each manifold by its tangent plane at a point and measure the distance between the two tangents, or between the tangent plane and a point, by solving a low-dimensional linear system.

Tangent Distance Algorithm

- Similar to tangent distance algorithm
- Closely related to dataset augmentation, both require that the model be invariant to certain specified directions of change in the input. Dataset augmentation is the non-infinitesimal version of tangent propagation
- Trains a neural net classifier with extra penalty to make each output of the neural net locally invariant to known factors of variation
- Factors correspond to movement along the manifold near which examples of the same class concentrate.
- Local invariance achieved by requiring $\triangledown_x  f(x)$  to be orthogonal to known manifold tangent vectors $v^{(i)}$ at $x$
- Equivalently, the directional derivative of $f$ at $x$ in the directions $v^{(i)}$ be small by adding a regularization penalty $\omega$, defined as: $\Omega(f) = \sum_{i} ((\triangledown_x f(x))^\top v^{(i)})^2$, which can be scaled by a hyperparameter, and for most neural networks, we would need to sum over many outputs
- Tangent vectors are derived a priori, usually from knowledge of the effect of transformations
- Has been used for supervised learning and reinforcement learning
- User encodes prior knowledge of task by specifying a set of transformations that should not alter the output, and analytically regularizes the model to resist perturbation in the directions corresponding to the specified transformation
- Only regularizes the model to resist infinitesimal perturbation, and poses difficulties for models based on rectified linear units
- Related to double backprop and adversarial training, both of which require that the model should be invariant to all directions of change in the input as long as the change is small
- Double backprop regularizes the Jacobian to be small
- Adversarial training finds inputs near the original inputs and trains the model to produce the same output on these as on the original inputs
- Adversarial training is the non-infinitesimal version of double backprop


Tangent Propagation Algorithm

Eliminates the need to know the tangent vectors a priori. Uses autoencoders to estimate manifold tangent vectors to avoid needing user-specified tangent vectors, which go beyond the classical invariants from the geometry of images and include factors that must be learned because they are object-specific. The algorithm first uses an autoencoder to learn the manifold structure by unsupervised learning, then uses the tangents to regularize a neural net classifier, similar to tangent propagation.

Manifold Tangent Classifier

- Boosting helps to build a strong ensemble( ensemble is a collection of multiple machine learning models) compared to a good capacity( strong learning model) individual machine learning model.

- In the context of deep learning all machine learning models are neural networks.

- Individual neural network is also considered as an ensemble which is improved by incrementally adding hidden layers.

- Output from the ensemble is combined to predict the output.

Boosting in Deep Learning

Appropriate Regularization/ Representation

Weight decay, commonly known as $$\ell_2$$ regularization, is a widely used technique for regularizing parametric machine learning models. Instead of directly manipulating the number of parameters, weight decay operates by restricting the values that the parameters can take. The technique is motivated by the intuition that the simplest function is $$f = 0$$, and the complexity of a linear function, such as $$f(\mathbf{x}) = \mathbf{w}^	op \mathbf{x}$$, can be measured by the distance of its parameters from zero.

Weight Decay

Unlike $$\ell_2$$ regularization which distributes weights evenly, $$\ell_1$$ regularization penalizes the absolute values of the weights. This leads to models that concentrate weights on a small set of features by clearing the other weights to zero, making it an effective method for feature selection. If a model only relies on a few features, it may eliminate the need to collect, store, or transmit data for the dropped features. Linear models that are $$\ell_1$$-regularized are popularly known as lasso regression.

Learn Before

Related

Learn After