When differentiating a vector-valued function, such as the element-wise product $$y = x * x$$, deep learning frameworks require the output to be reduced to a scalar to compute a gradient vector of the same shape as the input. For example, given the input vector $$x = [0, 1, 2, 3]$$, the element-wise product yields $$y = [0, 1, 4, 9]$$. Reducing this output by summing its elements gives $$\sum_i x_i^2$$, the gradient of which is $$2x$$, resulting in the vector $$[0, 2, 4, 6]$$. Frameworks handle this reduction differently: TensorFlow and MXNet implicitly sum the output vector, PyTorch requires passing a reduction vector of ones via the gradient argument (e.g., y.backward(gradient=torch.ones(len(y)))) or using an explicit sum, and JAX requires the function to explicitly return a scalar sum before applying the grad transform.

Gradient of an Element-wise Product Example

When differentiating non-scalar outputs, deep learning frameworks handle gradient reduction differently. TensorFlow and MXNet implicitly reduce the output tensor to a scalar by summing all elements of the output vector $$\mathbf{y}$$, calculating the gradient of that sum to return $$\partial_{\mathbf{x}} \sum_i y_i$$ instead of the full Jacobian matrix $$\partial_{\mathbf{x}} \mathbf{y}$$. In contrast, JAX does not perform implicit summation; its grad function is strictly defined for scalar outputs, meaning the user must explicitly construct a scalar-valued function (such as calling .sum()) prior to applying the gradient operation.

Claude

While the Jacobian matrix provides the full derivative of a vector output, it is more common in machine learning to sum the gradients of each component of an output vector $$\mathbf{y}$$ with respect to the full input vector $$\mathbf{x}$$. This reduction yields a gradient vector that has the exact same shape as $$\mathbf{x}$$, a technique frequently used to aggregate gradients calculated individually for each training example in a batch.

Gradient Reduction for Non-Scalar Outputs

Dive into Deep Learning

Deep learning frameworks differ in how they process gradients for non-scalar tensors. In PyTorch, invoking automatic differentiation on a non-scalar output raises an error unless a reduction vector, $$\mathbf{v}$$, is provided. This vector is passed through the gradient argument, instructing the framework to compute the vector-Jacobian product $$\mathbf{v}^	op \partial_{\mathbf{x}} \mathbf{y}$$ rather than the full Jacobian matrix $$\partial_{\mathbf{x}} \mathbf{y}$$.

Learn Before

Related

Learn After