Learn Before
Concept

Batch Norm in Deep Learning Implementation

In each layer, before we put our input z(i)z^{(i)} to the activation function, we first normalize it by Batch Norm. That is, we first calculate μ=1niz(i)\mu = \frac{1}{n}\sum_{i} z^{(i)} σ2=1ni(z(i)μ)2\sigma^2=\frac{1}{n}\sum_{i}(z^{(i)}-\mu)^2 znorm(i)=z(i)μσ2+ϵz_{norm}^{(i)} = \frac{z^{(i)}-\mu}{\sqrt{\sigma^2+\epsilon}} z~(i)=γ(i)znorm(i)+β(i)\tilde{z}^{(i)}=\gamma^{(i)} z_{norm}^{(i)} + \beta^{(i)} We introduce γ(i)\gamma^{(i)} and β(i)\beta^{(i)} because we don't want all the inputs to neurons in hidden layers to always have mean 0 and variance 1. a(i)=g(z~(i))\Rightarrow a^{(i)} = g(\tilde{z}^{(i)}), where g is some activation function. Recall in each layer, z(i)=W(i)a(i1)+b(i)z^{(i)}= W^{(i)} a^{(i-1)}+b^{(i)} When we calculate z~(i)\tilde{z}^{(i)}, we first normalize z(i)z^{(i)} by subtracting the mean. So the value of b(i)b^{(i)} has no influence on the result. \Rightarrow In each layer, we have three parameters W(i),γ(i),β(i)W^{(i)}, \gamma^{(i)}, \beta^{(i)}. Frequently, we combine this with mini-batches; i.e., we train W(i)W^{(i)}, γ(i)\gamma^{(i)}, β(i)\beta^{(i)} with respect to X{1},...,X{n}X^{\{1\}}, ..., X^{\{n\}}, where X{i}X^{\{i\}} is ithi^{th} batch of the training data.

0

2

Updated 2020-11-30

Tags

Data Science