What happens when you increase the regularization hyperparameter lambda?

The trade-off between the standard prediction loss and the additive weight decay penalty is characterized by the regularization constant, $$\lambda$$. This nonnegative hyperparameter is fit using validation data and modifies the objective to $$L(\mathbf{w}, b) + \frac{\lambda}{2} \|\mathbf{w}\|^2$$. When $$\lambda = 0$$, the original loss function is recovered. For $$\lambda > 0$$, the size of the weights is restricted, with larger values of $$\lambda$$ constraining the weights more considerably. The penalty term is divided by $$2$$ by convention so that the constant cancels out gracefully when the derivative of the quadratic function is taken.

University of Michigan - Ann Arbor

Claude

Hyperparameters related to neural network structure:
- Number of hidden layers (Depth)
- Number of hidden units (Width)
- Dropout method
- Activation function for each layer
- Weights Initialization

Hyperparameters related to training algorithm:
- Learning rate $\alpha$
- Momentum parameter $\beta \sim 0.9$
- $\beta_1 \sim 0.9, \beta_2 \sim 0.999, \epsilon \sim 10^{-8}$
- Number of Gradient descent iterations
- Mini-batch size 
- Optimizer algorithm
- Learning rate decay
- Regularization rate $\lambda$ 

List of Common Hyperparameters in Deep Learning

Weight decay, commonly known as $$\ell_2$$ regularization, is a widely used technique for regularizing parametric machine learning models. Instead of directly manipulating the number of parameters, weight decay operates by restricting the values that the parameters can take. The technique is motivated by the intuition that the simplest function is $$f = 0$$, and the complexity of a linear function, such as $$f(\mathbf{x}) = \mathbf{w}^	op \mathbf{x}$$, can be measured by the distance of its parameters from zero.

Weight Decay

Unlike $$\ell_2$$ regularization which distributes weights evenly, $$\ell_1$$ regularization penalizes the absolute values of the weights. This leads to models that concentrate weights on a small set of features by clearing the other weights to zero, making it an effective method for feature selection. If a model only relies on a few features, it may eliminate the need to collect, store, or transmit data for the dropped features. Linear models that are $$\ell_1$$-regularized are popularly known as lasso regression.

L1 Regularization

Dive into Deep Learning

For a feedforward neural network, the depth of the network is the number of hidden layers plus one (as the output layer is also parameterized). The width of the network is the dimensionality of its hidden layers.

The main architectural considerations for designing a neural network are choosing the depth of the network and the width of each layer.


Depth and Width for Neural Networks

Dropout means what percentage of neurons should be randomly “killed” during each epoch to prevent overfitting.

Dropout

Neural network learning rate means how fast the backpropagation algorithm performs gradient descent. A larger learning rate makes the network train faster but might result in missing the minimum of the loss function.

Neural Network Learning Rate

When you try to train a huge amount of data with limited computer memory, you can separate the whole training set to several batches that can fit into your computer memory. Then you feed these batches to your model one by one. After feeding all batches once, you complete one epoch. To successfully train your model, you need multiple epochs.

Epochs in Machine Learning

In a neural network, numeric data points, called inputs, are fed into the neurons in the input layer. Each neuron has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred to the next layer.
The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold. Or it can be a transformation that maps the input signals into output signals that are needed for the neural network to function.
Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction. Activation functions also help normalize the output of each neuron to a range between 1 and 0 or between -1 and 1. When used in a multilayer neural network, activation functions can be different for different layers. 

Activation Functions in Neural Networks

When a neural network trains, it uses an algorithm to determine the optimal weights for the model, called an optimizer. 
There are several optimizer algorithms, such as:
- Gradient descent
- Mini-batch gradient descent
- Gradient descent with momentum
- RMSprop
- Adam optimization algorithm
- Nesterov momentum
- AdaGrad

Deep Learning Optimizer Algorithms

Initial weights are applied to all the neurons.
It is necessary to set initial weights for the first forward pass. Two basic options are to set weights to zero or to randomize them. However, this can result in a vanishing or exploding gradient, which will make it difficult to train the model. To mitigate this problem, you can use a heuristic (a formula tied to the number of neuron layers) to determine the weights. A common heuristic used for the Tanh activation is called Xavier initialization.

On top of that if the architecture being used has already been trained by someone else (such as ImageNet) the node values can be initialized to those values. This is known as transfer learning and it can be very useful to substantially increase training speed and model accuracy.

Deep Learning Weight Initialization

- Grid search
- Randomly choosing
- Coarse to fine

Hyperparameters Tuning Methods in Deep Learning

A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data. It is required by the model when making predictions and estimated or learned from data. It is often not set manually by the practitioner. E.g. The coefficients in a linear regression or logistic regression; the weights in an artificial neural network, etc.

A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data. It is often used in processes to help estimate model parameters. It is often specified by the practitioner. We may use rules of thumb, copy values used on other problems, or search for the best value by trial and error. E.g. The learning rate for training a neural network; the K in K-nearest neighbors, etc.

Difference between Model Parameter and Model Hyperparameter

Regularization Constant

Batch normalization is a technique designed to accelerate and stabilize the training of deep neural networks. Mechanistically, it centers and rescales the intermediate layer activations back to a controlled mean and variance, preventing their distributions from diverging across layers and over time. By keeping these intermediate values on a comparable scale, batch normalization enables the use of more aggressive learning rates. The technique was originally motivated by the concept of covariate shift applied to internal layers, but the hypothesis that it works by reducing this so-called internal covariate shift has since been challenged and does not appear to be a valid explanation for its effectiveness. Although intuitively thought to make the optimization landscape smoother, the precise mechanism by which batch normalization aids training remains an open research question. Despite this theoretical uncertainty, batch normalization has proven indispensable in practice, being applied in nearly all deployed image classifiers and earning the original paper tens of thousands of citations.

Batch Normalization

In machine learning _Frobenius_ and _L2_ norms can be used interchangeably. Nevertheless there is slight difference between these two. _L2_ norm in essence is an Euclidean norm with special case, when p=2. More specifically for n-dimensional vector we will have:
   $$L_2 = \sqrt {\textstyle\sum_{i=1}^n x_{i}^{2}}$$.
The _Frobenius_ measures the same thing but in this case we have matrices. That's why in neural networks _L2_ norm is frequently referred as _Forbenius_ norm.

_Frobenius_ and _L2_

In Ridge regression the coefficients and bias are learned using the same least-square criterion, but it adds a penalty for large variations in coefficients; i.e., coefficients are found by minimizing a tuning parameter - which controls the strength of the penalty term.
Once the parameters are learned, the ridge regression prediction formula is the same as OLS.
Ridge regression uses L2 regularization that minimizes the sum of square of coefficients and the influence of the regularization term is controlled by the $\alpha$ parameter. Higher $\alpha$ means more regularization and simpler models.
Use Ridge regression when the number of predictor variables is greater than the number of observations. Below is the formula found in our textbook. 
$$RSS_{RIDGE}(\beta_{j}, \beta_{0}) =
\sum_{i=1}^{n} ({y}_{i}-{\beta}_{0}-\sum_{j=1}^{p}{\beta}_{j}{x}_{ij})^2 +\lambda\sum_{j=1}^{p}{\beta}_{j}^2=RSS+\lambda\sum_{j=1}^{p}{\beta}_{j}^2.
$$ 

Note: Ridge Regression is sensitive to scales of variable. Therefore, we usually standardize the predictors before applying Ridge Regression.

Ridge Regression

What is weight decay?

A normal distribution, also known as a Gaussian distribution, is a continuous probability distribution defined by its mean $$\mu$$ and variance $$\sigma^2$$ (where $$\sigma$$ is the standard deviation). The probability density function is given by the formula: $$p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (x - \mu)^2ight)$$. Changing the mean $$\mu$$ corresponds to a shift of the distribution along the $$x$$-axis, while increasing the variance $$\sigma^2$$ spreads the distribution out and lowers its peak.

Gaussian (Normal) Distribution

In deep learning, typical strengths of weight decay (such as $$\ell_2$$ regularization) are usually insufficient to prevent highly parameterized networks from fully interpolating the training data. While these regularizers were classically thought to restrict models from fitting arbitrary labels, their effectiveness in deep architectures is often contingent on being paired with an early stopping criterion.

Insufficiency of Weight Decay for Preventing Interpolation

Lasso Regression is similar to ridge regression, but instead we use the L1 regularization penalty.
This has the effect of setting some coefficient estimates to exactly zero for the least influential variables, which leads to models only including a subset of variables. This is called a space solution, which is a kind of feature selection.
The parameter $\alpha$ controls the amount of L1 regularization (default = 1.0).
The prediction formula is the same as OLS.

$$RSS_{LASSO}(\beta_{j}, \beta_{0}) =
\sum_{i=1}^{n} ({y}_{i}-{\beta}_{0}-\sum_{j=1}^{p}{\beta}_{j}{x}_{ij})^2 +\lambda\sum_{j=1}^{p}\vert{\beta}_{j}\vert=RSS+\lambda\sum_{j=1}^{p}\vert{\beta}_{j}\vert.
$$ 

Lasso Regression

Laplace distribution also called double exponential distribution has a sharp bend at random values of  $\mu$

$$Laplace(x;\mu,\gamma)= \frac{1}{2 \gamma} e^{\frac{ - \lvert x - \mu \rvert}{\gamma}}$$


Learn Before

Related

Learn After