When using TensorFlow's tf.GradientTape to record operations for automatic differentiation, the tape can normally only be used to compute a gradient once. If the user needs to invoke the gradient() method multiple times—such as computing the gradient of a final output and separately computing the gradient of a detached intermediate variable—the tape must be instantiated with the persistent=True argument. This parameter ensures that the compute graph is explicitly preserved after the first gradient calculation.

Persistent Gradient Tape in TensorFlow

Modern deep learning libraries offer specific built-in functions to detach variables from a computational graph, effectively stopping the backward flow of gradients. In PyTorch and MXNet, a tensor's computational history can be erased using the .detach() method. In JAX, this provenance wiping is achieved by wrapping the operation with the jax.lax.stop_gradient() function. Similarly, TensorFlow provides the tf.stop_gradient() function to accomplish the exact same detachment.

Claude

During automatic differentiation, it is sometimes necessary to exclude specific calculations from the recorded computational graph. This is often required when inputs generate auxiliary intermediate variables that should not contribute to the gradient computation. To achieve this, the computational graph associated with these intermediate terms is detached from the final result. A new variable is instantiated with the same numerical value, but its provenance—the record of operations that created it—is completely erased. Consequently, this new variable acts as a constant with no ancestors in the graph, preventing gradients from flowing back through it to earlier variables.

Detaching Computation

Dive into Deep Learning

Consider a scenario with variables defined as $$y = x * x$$ and $$z = x * y$$. To isolate the direct influence of $$x$$ on $$z$$ without accounting for its indirect influence through $$y$$, we introduce a detached variable, $$u$$. This variable $$u$$ takes the same numerical value as $$y$$, but its computational history is erased. When we compute the gradient of the new relation $$z = x * u$$ with respect to $$x$$, the result is simply $$u$$ (which equals $$x^2$$). In contrast, differentiating the fully connected expression $$z = x * x * x$$ would erroneously yield $$3 * x * x$$ for this specific goal.

Detaching Computation Example

Detaching a variable from the computational graph of a subsequent downstream calculation does not destroy the original graph that created the variable itself. For example, if the ancestors of a variable $$y$$ are detached to prevent gradient flow into a subsequent variable $$z$$, the computational history leading up to $$y$$ remains fully intact. Therefore, it is still possible to compute the gradient of $$y$$ with respect to its own independent inputs, as its local dependency graph persists.

Learn Before

Related

Learn After