In deep learning, adding capacity by nesting function classes allows for strictly more powerful, rather than just subtly different, function classes. Residual connections accomplish this by allowing additional layers to pass the input directly to the output. Consequently, this architectural choice shifts the network's inductive bias: instead of assuming that simple functions take the form $$f(\mathbf{x}) = 0$$, the network assumes that simple functions look like $$f(\mathbf{x}) = \mathbf{x}$$. This makes it significantly easier for the residual mapping to learn the identity function by pushing the parameters in the weight layer toward zero.

Claude

In a residual network, the desired underlying mapping $$f(\mathbf{x})$$—the function the network ultimately aims to approximate—is not learned directly by a stack of layers. Instead, those layers are reformulated to learn only the residual mapping $$g(\mathbf{x}) = f(\mathbf{x}) - \mathbf{x}$$, and the target function is recovered as $$f(\mathbf{x}) = g(\mathbf{x}) + \mathbf{x}$$. This reformulation is motivated by the degradation problem: as plain networks grow deeper, their training accuracy can paradoxically worsen, suggesting that the added layers struggle to approximate even the identity function. By recasting the problem in terms of $$g(\mathbf{x})$$, the identity case $$f(\mathbf{x}) = \mathbf{x}$$ reduces to $$g(\mathbf{x}) = 0$$, which is significantly easier for a network to learn because it only requires driving the weights and biases of the constituent layers toward zero.

Residual Mapping

Dive into Deep Learning

Inductive Bias of Residual Connections

ResNet decomposes a target function into a simple linear term and a more complex nonlinear one, mathematically expressed as $$f(\mathbf{x}) = \mathbf{x} + g(\mathbf{x})$$. This approach is conceptually similar to a Taylor expansion, which decomposes a function into terms of increasingly higher order at a given point (e.g., $$f(x) = f(0) + x \cdot \left[f'(0) + x \cdot \left[\frac{f''(0)}{2!} + \cdots ight]ight]$$). By isolating the identity mapping ($$\mathbf{x}$$), ResNet allows the neural network to focus on learning the more complex nonlinear residual ($$g(\mathbf{x})$$).

Learn Before

Related