When initializing network weights, we face a dilemma: to keep variance fixed during forward propagation, we need $$n_	extrm{in} \sigma^2 = 1$$, but for backpropagation, we need $$n_	extrm{out} \sigma^2 = 1$$. It is generally impossible to satisfy both conditions simultaneously unless the number of inputs equals the number of outputs. As a practical compromise, we try to satisfy the average of the two conditions: $$\frac{1}{2} (n_	extrm{in} + n_	extrm{out}) \sigma^2 = 1$$. This simplifies to the target weight standard deviation of $$\sigma = \sqrt{\frac{2}{n_	extrm{in} + n_	extrm{out}}}$$, which forms the mathematical condition for Xavier initialization.

Xavier Initialization Condition

During backpropagation through a fully connected layer without nonlinearities, the network faces a variance scaling problem similar to that in forward propagation. Gradients propagating backward from layers closer to the output can exponentially blow up or vanish. By applying the same statistical reasoning used for the forward pass, we find that to keep the variance of these gradients fixed, the weight variance $$\sigma^2$$ must satisfy the condition $$n_	extrm{out} \sigma^2 = 1$$, where $$n_	extrm{out}$$ is the number of outputs for that specific layer.

Claude

When analyzing the scale distribution of an output $$o_i$$ for a fully connected layer without nonlinearities, the output is computed as $$o_i = \sum_{j=1}^{n_	extrm{in}} w_{ij} x_j$$. Assuming the inputs $$x_j$$ and weights $$w_{ij}$$ are drawn independently with a mean of $$0$$ and variances of $$\gamma^2$$ and $$\sigma^2$$ respectively, the expected value $$E[o_i]$$ is $$0$$. We can compute the variance as $$	extrm{Var}[o_i] = E[o_i^2] - (E[o_i])^2 = \sum_{j=1}^{n_	extrm{in}} E[w^2_{ij} x^2_j] - 0 = \sum_{j=1}^{n_	extrm{in}} E[w^2_{ij}] E[x^2_j] = n_	extrm{in} \sigma^2 \gamma^2$$. Note that the distribution does not have to be Gaussian, but the mean and variance must exist. To keep this variance fixed during forward propagation and prevent it from changing across layers, the initialization must satisfy the condition $$n_	extrm{in} \sigma^2 = 1$$.

Learn Before

Related

Learn After