To continue backpropagation backwards towards the input layer, we calculate the gradient of the objective function $$J$$ with respect to the hidden layer output vector $$\mathbf{h} \in \mathbb{R}^h$$. By applying the chain rule through the output layer variable $$\mathbf{o}$$, we obtain:
$$ \frac{\partial J}{\partial \mathbf{h}} = \textrm{prod}\left(\frac{\partial J}{\partial \mathbf{o}}, \frac{\partial \mathbf{o}}{\partial \mathbf{h}}\right) = {\mathbf{W}^{(2)}}^\top \frac{\partial J}{\partial \mathbf{o}} $$
This operation successfully propagates the error gradient backward by multiplying it with the transpose of the output layer's weight matrix.

Claude

Following the calculation of the linear intermediate variable $$\mathbf{z}$$, a neural network applies a non-linear activation function $$\phi$$ to produce the hidden layer's output. The hidden activation vector $$\mathbf{h}$$ of length $$h$$ is defined as:

$$\mathbf{h} = \phi(\mathbf{z})$$

This vector $$\mathbf{h}$$ serves as another intermediate variable in the network's forward pass, containing the activated representations that will be passed to the subsequent layer.

Hidden Activation Vector Formula

Dive into Deep Learning

The final linear transformation in a single hidden layer neural network occurs at the output layer. Using the hidden activation vector $$\mathbf{h}$$ as input, and assuming the output layer uses a weight parameter matrix $$\mathbf{W}^{(2)} \in \mathbb{R}^{q 	imes h}$$ (without a bias term), the output layer variable $$\mathbf{o}$$ of length $$q$$ is calculated as:

$$\mathbf{o} = \mathbf{W}^{(2)} \mathbf{h}$$

Learn Before

Related