Learn Before
Concept

Back Propagation Example

Input for a 2 layer MLP (Multi Layer Perceptron) is given as XX, the output is given as YY

Thus, there are 2 parameter matrices W(1)W^{(1)} and W(2)W^{(2)} for layers 1 and 2 respectively.

Layer 1 also has a hidden "relu" feature such that the output from layer 1 is constrained by H=max{0,XW(1)}H = max\{0, XW^{(1)}\}

The net or total cost function JJ is given my the cross-entropy cost JMLEJ_{MLE} added with a regularization term λ(ij(Wij(1))2+ij(Wij(2))2)\lambda(\sum_{i j}^{}(W^{(1)}_{ij})^2 + \sum_{i j}^{}(W^{(2)}_{ij})^2)

J=JMLE+λ(ij(Wij(1))2+ij(Wij(2))2)J = J_{MLE} + \lambda(\sum_{i j}^{}(W^{(1)}_{ij})^2 + \sum_{i j}^{}(W^{(2)}_{ij})^2)

This produces the following computational graph image shown below.

Compute W(1)J\triangledown_{W^{(1)}}{J} and W(2)J\triangledown_{W^{(2)}}{J}

Back Propagation on this example is obviously simple on the weight decay side, but not so much on the cross-entropy side.

Let G=U(2)G = U^{(2)}

Gradient 1: g1=HTGg_1 = H^TG Gradient 2: g2=HJ=GW(2)Tg_2 = \triangledown_{H}J = GW^{(2)T} Gradient 3: g3=back_prop_relu(H,g2)g_3 = back\_prop\_relu(H, g_2) Gradient 4: g4=XTg3g_4 = X^Tg_3

Add g1g_1 and g4g_4 gradients to the gradients of W(1)W^{(1)} and W(2)W^{(2)} respectively (the values calculated from weight decay + the back propagated gradients). This results in the answers for W(1)J\triangledown_{W^{(1)}}{J} and W(2)J\triangledown_{W^{(2)}}{J}.

Image 0

0

1

Updated 2021-06-17

References


Tags

Data Science

Related