The gradient of the regularized objective function $$J$$ with respect to the model parameters closest to the output layer, $$\mathbf{W}^{(2)} \in \mathbb{R}^{q \times h}$$, is calculated using the chain rule. It combines the gradients propagated through the output layer variable $$\mathbf{o}$$ and the explicit gradient of the regularization term $$s$$:
$$ \frac{\partial J}{\partial \mathbf{W}^{(2)}}= \textrm{prod}\left(\frac{\partial J}{\partial \mathbf{o}}, \frac{\partial \mathbf{o}}{\partial \mathbf{W}^{(2)}}\right) + \textrm{prod}\left(\frac{\partial J}{\partial s}, \frac{\partial s}{\partial \mathbf{W}^{(2)}}\right)= \frac{\partial J}{\partial \mathbf{o}} \mathbf{h}^\top + \lambda \mathbf{W}^{(2)} $$
where $$\mathbf{h}^\top$$ represents the transpose of the hidden layer activation vector.

Gradient of Objective Function with Respect to Output Layer Weights

To compute the gradient of the objective function $$J$$ with respect to the output layer variable $$\mathbf{o} \in \mathbb{R}^q$$, the chain rule is applied through the loss term $$L$$. Because the gradient of $$J$$ with respect to $$L$$ is $$1$$, the formula simplifies directly to the partial derivative of the loss with respect to the output:
$$ \frac{\partial J}{\partial \mathbf{o}} = \textrm{prod}\left(\frac{\partial J}{\partial L}, \frac{\partial L}{\partial \mathbf{o}}\right) = \frac{\partial L}{\partial \mathbf{o}} $$

Claude

The final linear transformation in a single hidden layer neural network occurs at the output layer. Using the hidden activation vector $$\mathbf{h}$$ as input, and assuming the output layer uses a weight parameter matrix $$\mathbf{W}^{(2)} \in \mathbb{R}^{q 	imes h}$$ (without a bias term), the output layer variable $$\mathbf{o}$$ of length $$q$$ is calculated as:

$$\mathbf{o} = \mathbf{W}^{(2)} \mathbf{h}$$

Output Layer Variable Formula

Dive into Deep Learning

To evaluate the performance of a neural network on an individual data point, a loss function $$l$$ is applied to its predictions. Given the network's final output layer variable $$\mathbf{o}$$ and the true target label $$y$$, the unregularized loss term $$L$$ for a single example is formulated as:

$$L = l(\mathbf{o}, y)$$

This scalar value quantifies the discrepancy between the model's output and the actual ground truth for that specific data example before any complexity penalties are added.

Learn Before

Related

Learn After