Xavier initialization, named after its creators Glorot and Bengio (2010), is a standard technique designed to mitigate vanishing and exploding gradients by carefully setting the initial weights of a neural network layer. To balance the variance during both forward and backward propagation, it typically samples weights from a Gaussian distribution with a mean of $$0$$ and a variance of $$\sigma^2 = \frac{2}{n_	extrm{in} + n_	extrm{out}}$$, where $$n_	extrm{in}$$ and $$n_	extrm{out}$$ represent the number of inputs and outputs of the layer respectively. While the underlying assumption of linear activations is often violated in practice, this initialization method has proven highly effective.

Xavier Initialization

When initializing network weights, we face a dilemma: to keep variance fixed during forward propagation, we need $$n_	extrm{in} \sigma^2 = 1$$, but for backpropagation, we need $$n_	extrm{out} \sigma^2 = 1$$. It is generally impossible to satisfy both conditions simultaneously unless the number of inputs equals the number of outputs. As a practical compromise, we try to satisfy the average of the two conditions: $$\frac{1}{2} (n_	extrm{in} + n_	extrm{out}) \sigma^2 = 1$$. This simplifies to the target weight standard deviation of $$\sigma = \sqrt{\frac{2}{n_	extrm{in} + n_	extrm{out}}}$$, which forms the mathematical condition for Xavier initialization.

Claude

During backpropagation through a fully connected layer without nonlinearities, the network faces a variance scaling problem similar to that in forward propagation. Gradients propagating backward from layers closer to the output can exponentially blow up or vanish. By applying the same statistical reasoning used for the forward pass, we find that to keep the variance of these gradients fixed, the weight variance $$\sigma^2$$ must satisfy the condition $$n_	extrm{out} \sigma^2 = 1$$, where $$n_	extrm{out}$$ is the number of outputs for that specific layer.

Learn Before

Related

Learn After