To implement dropout for a layer computationally, we must draw samples from a Bernoulli distribution for each dimension, where a node is kept (value $$1$$) with probability $$1-p$$ and dropped (value $$0$$) with probability $$p$$. An efficient way to achieve this is to generate a tensor of samples from a uniform distribution $$U[0, 1]$$ and apply a threshold, keeping only those elements where the uniformly sampled value is strictly greater than the dropout probability $$p$$.

Implementing Dropout via Uniform Sampling

When implementing dropout in a neural network, the dropout operation is typically applied to the output of each hidden layer immediately following its non-linear activation function. This ensures that the neurons randomly zeroed out are the activated representations of the layer's output.

Applying Dropout After Activation

In a neural network architecture, dropout probabilities do not have to be uniform; they can be configured independently for each hidden layer. A common heuristic is to set a lower dropout probability for the layers closer to the input, as these initial layers often capture fundamental, low-level features that are critical for the network's overall performance.

Layer-Specific Dropout Probabilities

When dropout is applied to a hidden layer of a neural network, each hidden unit is zeroed out with a specified probability $$p$$. This effectively creates a sub-network containing only a subset of the original neurons. Consequently, the forward calculation of the outputs and the backward calculation of gradients during backpropagation no longer depend on the dropped nodes. This mechanism ensures that the output layer cannot become overly dependent on any single hidden unit.

Claude

Unlike L2 or L1 regularization, Dropout regularization works by assigning a probability P, which is the probability that a given node is turned off for the current iteration of training. By doing this, you make the model simpler while training, which will reduce overfitting, while still gaining the benefits a larger model has during testing. Also, no single node will have excess influence on the model, and the model will spread the weights out, giving us a similar effect to L2 regularization. Generally, Dropout Regularization is the preferred form of regularization, as it not only accomplishes results similar to L2 regularization, it also provides some robustness to the model as each iteration is randomized.

Dropout Regularization in Deep Learning

Dive into Deep Learning

https://jamesmccaffrey.wordpress.com/2019/05/07/neural-network-dropout-and-inverted-dropout/

Neural Network Dropout and Inverted Dropout

In modern implementations of dropout (often called inverted dropout), the activation tensor is not only multiplied by a binary mask but the remaining values are also rescaled. If elements are dropped out with probability $$p$$, the surviving elements are divided by $$1 - p$$. This rescaling step, performed during training, preserves the expected value of the activations and eliminates the need for scaling adjustments during the test phase.

Inverted Dropout Technique

  - The cost function isn't well-defined.
  - The cost can be really significant in the case of a complete system to the point where computational cost can outweigh its normal benefits
  - Dropout is also less effective when extremely few labeled training examples are available

Disadvantages of Dropout

For the variable P, as P gets larger, the regularization will decrease and the training error will be lower. Often times you will set P = 1 for certain layers that aren't overfitting as much in order to ensure you are keeping every unit in that layer.

Setting Probability P for Dropout Regularization in Deep Learning

Mechanics of Dropout on a Hidden Layer

Typically, dropout is disabled during the test time phase. When evaluating a trained model on a new example, no nodes are dropped out. Thus, the full capacity of the network is utilized, and there is no need to normalize the outputs to account for missing activations.

Disabling Dropout at Test Time

Dropout regularization reduces overfitting for two related reasons. First, because a random subset of units is dropped on each training iteration, the network is effectively forced to train using a smaller, thinner sub-network at every step, which limits the capacity available to memorize noise in the training data. Second, dropout tends to shrink the squared norm of the weights, producing a regularizing effect similar to L2 regularization; because no single unit can rely on any other specific unit being present, weight magnitude and influence are spread out across the network rather than concentrated on a few features, which further prevents overfitting.

Learn Before

Related

Learn After