Formula

Gradient of Objective Function with Respect to Output Layer Weights

The gradient of the regularized objective function JJ with respect to the model parameters closest to the output layer, W(2)Rq×h\mathbf{W}^{(2)} \in \mathbb{R}^{q \times h}, is calculated using the chain rule. It combines the gradients propagated through the output layer variable o\mathbf{o} and the explicit gradient of the regularization term ss: JW(2)=prod(Jo,oW(2))+prod(Js,sW(2))=Joh+λW(2)\frac{\partial J}{\partial \mathbf{W}^{(2)}}= \textrm{prod}\left(\frac{\partial J}{\partial \mathbf{o}}, \frac{\partial \mathbf{o}}{\partial \mathbf{W}^{(2)}}\right) + \textrm{prod}\left(\frac{\partial J}{\partial s}, \frac{\partial s}{\partial \mathbf{W}^{(2)}}\right)= \frac{\partial J}{\partial \mathbf{o}} \mathbf{h}^\top + \lambda \mathbf{W}^{(2)} where h\mathbf{h}^\top represents the transpose of the hidden layer activation vector.

0

1

Updated 2026-05-06

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L