Deep learning frameworks differ in how they process gradients for non-scalar tensors. In PyTorch, invoking automatic differentiation on a non-scalar output raises an error unless a reduction vector, $$\mathbf{v}$$, is provided. This vector is passed through the gradient argument, instructing the framework to compute the vector-Jacobian product $$\mathbf{v}^	op \partial_{\mathbf{x}} \mathbf{y}$$ rather than the full Jacobian matrix $$\partial_{\mathbf{x}} \mathbf{y}$$.

Vector-Jacobian Product via the Gradient Argument

When differentiating non-scalar outputs, deep learning frameworks handle gradient reduction differently. TensorFlow and MXNet implicitly reduce the output tensor to a scalar by summing all elements of the output vector $$\mathbf{y}$$, calculating the gradient of that sum to return $$\partial_{\mathbf{x}} \sum_i y_i$$ instead of the full Jacobian matrix $$\partial_{\mathbf{x}} \mathbf{y}$$. In contrast, JAX does not perform implicit summation; its grad function is strictly defined for scalar outputs, meaning the user must explicitly construct a scalar-valued function (such as calling .sum()) prior to applying the gradient operation.

Implicit Gradient Summation in Deep Learning Frameworks

While the Jacobian matrix provides the full derivative of a vector output, it is more common in machine learning to sum the gradients of each component of an output vector $$\mathbf{y}$$ with respect to the full input vector $$\mathbf{x}$$. This reduction yields a gradient vector that has the exact same shape as $$\mathbf{x}$$, a technique frequently used to aggregate gradients calculated individually for each training example in a batch.

Claude

When a multivariate function outputs a vector rather than a scalar, the most natural representation of its derivative is the Jacobian matrix. Specifically, the derivative of a vector $$\mathbf{y}$$ with respect to an input vector $$\mathbf{x}$$ is a matrix that compiles the partial derivatives of each individual component of $$\mathbf{y}$$ with respect to each component of $$\mathbf{x}$$.

Jacobian Matrix

Dive into Deep Learning

Gradient Reduction for Non-Scalar Outputs

When differentiating higher-order tensors, the dimensionality of the resulting derivative expands. Just as the derivative of a vector with respect to another vector yields a matrix (the Jacobian), calculating the derivative of a higher-order tensor $$\mathbf{y}$$ with respect to a higher-order tensor $$\mathbf{x}$$ produces an even higher-order tensor. This resulting tensor contains the partial derivatives of each component of $$\mathbf{y}$$ with respect to each component of $$\mathbf{x}$$.

Learn Before

Related

Learn After