With the inverted dropout technique, at test time:

Inverted dropout question

A dropout layer can be implemented algorithmically from scratch by first checking if the dropout probability $$p$$ equals $$1$$; if so, it immediately returns a tensor of zeros. For $$p < 1$$, it generates a boolean mask by evaluating where a tensor of uniform random numbers $$U[0, 1]$$ is strictly greater than $$p$$. This boolean mask is cast to a floating-point format, multiplied element-wise with the input tensor, and divided by $$1 - p$$ to rescale the surviving elements.

Algorithm for a Dropout Layer

In modern implementations of dropout (often called inverted dropout), the activation tensor is not only multiplied by a binary mask but the remaining values are also rescaled. If elements are dropped out with probability $$p$$, the surviving elements are divided by $$1 - p$$. This rescaling step, performed during training, preserves the expected value of the activations and eliminates the need for scaling adjustments during the test phase.

University of Michigan - Ann Arbor

Claude

Unlike L2 or L1 regularization, Dropout regularization works by assigning a probability P, which is the probability that a given node is turned off for the current iteration of training. By doing this, you make the model simpler while training, which will reduce overfitting, while still gaining the benefits a larger model has during testing. Also, no single node will have excess influence on the model, and the model will spread the weights out, giving us a similar effect to L2 regularization. Generally, Dropout Regularization is the preferred form of regularization, as it not only accomplishes results similar to L2 regularization, it also provides some robustness to the model as each iteration is randomized.

Dropout Regularization in Deep Learning

https://jamesmccaffrey.wordpress.com/2019/05/07/neural-network-dropout-and-inverted-dropout/

Neural Network Dropout and Inverted Dropout

Dive into Deep Learning

It is successful because:
- With each iteration, you are working with a smaller network
- It will tend to shrink the squared norm of the weights
- Shrinking these weights prevents overfitting

Why does dropout work?

Inverted Dropout Technique

  - The cost function isn't well-defined.
  - The cost can be really significant in the case of a complete system to the point where computational cost can outweigh its normal benefits
  - Dropout is also less effective when extremely few labeled training examples are available

Disadvantages of Dropout

For the variable P, as P gets larger, the regularization will decrease and the training error will be lower. Often times you will set P = 1 for certain layers that aren't overfitting as much in order to ensure you are keeping every unit in that layer.

Setting Probability P for Dropout Regularization in Deep Learning

When dropout is applied to a hidden layer of a neural network, each hidden unit is zeroed out with a specified probability $$p$$. This effectively creates a sub-network containing only a subset of the original neurons. Consequently, the forward calculation of the outputs and the backward calculation of gradients during backpropagation no longer depend on the dropped nodes. This mechanism ensures that the output layer cannot become overly dependent on any single hidden unit.

Mechanics of Dropout on a Hidden Layer

Typically, dropout is disabled during the test time phase. When evaluating a trained model on a new example, no nodes are dropped out. Thus, the full capacity of the network is utilized, and there is no need to normalize the outputs to account for missing activations.

Learn Before

Related

Learn After