For β=1, the function becomes equivalent to the Sigmoid-weighted Linear Unit (SiL) function used in reinforcement learning, whereas for β=0, the functions turns into the scaled linear function f(x)=x/2. With β→∞, the sigmoid component approaches a 0-1 function, so swish becomes like the ReLU function. Thus, it can be viewed as a smoothing function which nonlinearly interpolates between a linear and the ReLU function.

University of Michigan - Ann Arbor

The Swish function, introduced by Ramachandran et al. in 2017, is a mathematical function defined as the product of its input and the sigmoid function applied to a scaled version of the input. For a scalar input $x$, it is expressed as:

$ \operatorname{swish}(x) = x \cdot \operatorname{sigmoid}(\beta x) = \frac{x}{1+e^{-\beta x}}$

where $\beta$ is a constant or a trainable parameter. When applied element-wise to a vector $\mathbf{h}$ in neural networks, the formula is written as:

$\sigma_{\text{swish}}(\mathbf{h}) = \mathbf{h} \odot \text{Sigmoid}(\beta \mathbf{h})$

where $\odot$ denotes the element-wise product.

Swish Function

Wikipedia

Relationship between Swish Function and other Activation Functions

Consider the function defined as f(x) = x / (1 + e^(-βx)), where β is a positive parameter. Analyze the behavior of this function as the parameter β becomes extremely large (i.e., approaches infinity). Which of the following statements best describes the resulting function's behavior?

An activation function is defined by the formula `f(x) = x / (1 + e^(-βx))`, where `β` is a positive constant. Analyze the behavior of this function for negative input values (x < 0). Explain why the function is considered 'non-monotonic' in this region, describing how the output `f(x)` changes as `x` becomes increasingly negative (i.e., approaches -∞).

Analysis of Swish Function Behavior

A deep neural network is being trained, but a significant number of neurons consistently output zero for any negative input they receive. This causes these neurons to stop learning, a phenomenon that hinders the model's overall performance. An engineer proposes replacing the problematic activation function with the function defined as f(x) = x / (1 + e^(-βx)), where β is a positive constant. Based on the mathematical properties of this proposed function, justify why this change is a reasonable strategy to mitigate the issue of non-learning neurons.

Evaluating Activation Function Properties

The 2017 paper by Ramachandran et al. introduced the Swish activation function, defining it with the formula $$\sigma_{\text{swish}}(\mathbf{h}) = \mathbf{h} \odot \text{Sigmoid}(c\mathbf{h})$$, where $$c$$ is a constant or a trainable parameter.

Learn Before

Related