Deep learning frameworks differ in how they process gradients for non-scalar tensors. In PyTorch, invoking automatic differentiation on a non-scalar output raises an error unless a reduction vector, $$\mathbf{v}$$, is provided. This vector is passed through the gradient argument, instructing the framework to compute the vector-Jacobian product $$\mathbf{v}^	op \partial_{\mathbf{x}} \mathbf{y}$$ rather than the full Jacobian matrix $$\partial_{\mathbf{x}} \mathbf{y}$$.

Claude

While the Jacobian matrix provides the full derivative of a vector output, it is more common in machine learning to sum the gradients of each component of an output vector $$\mathbf{y}$$ with respect to the full input vector $$\mathbf{x}$$. This reduction yields a gradient vector that has the exact same shape as $$\mathbf{x}$$, a technique frequently used to aggregate gradients calculated individually for each training example in a batch.

Gradient Reduction for Non-Scalar Outputs

Dive into Deep Learning

Vector-Jacobian Product via the Gradient Argument

When differentiating non-scalar outputs, deep learning frameworks handle gradient reduction differently. TensorFlow and MXNet implicitly reduce the output tensor to a scalar by summing all elements of the output vector $$\mathbf{y}$$, calculating the gradient of that sum to return $$\partial_{\mathbf{x}} \sum_i y_i$$ instead of the full Jacobian matrix $$\partial_{\mathbf{x}} \mathbf{y}$$. In contrast, JAX does not perform implicit summation; its grad function is strictly defined for scalar outputs, meaning the user must explicitly construct a scalar-valued function (such as calling .sum()) prior to applying the gradient operation.

Learn Before

Related