Learn Before
  • ReLU (Rectified Linear Unit)

Pros and Cons of ReLU

Pros:

  • Computationally efficient—allows the network to converge very quickly
  • Non-linear—although it looks like a linear function, ReLU has a derivative function and allows for backpropagation.
  • If you're not sure what activation function to use for the hidden layers, it's better to use ReLU by default.
  • Jeffery Hinton: Allows neuron to express a strong opinion
  • Gradient doesn't saturate (on the high end)
  • Less sensitive to random initialization
  • Runs great on low precision hardware

Cons:

  • The Dying ReLU problem (Dead neuron): when inputs approach negative values, the gradient of the function becomes zero, the network cannot perform backpropagation and cannot learn. => Solution: Leaky ReLU
  • Gradient discontinuous at origin: when inputs equal to 0, there is no derivative since it's in the intersection of a horizontal line and a linear line. So learning is not happening there. => Solution: GELU

0

3

4 years ago

Tags

Data Science

Related
  • Pros and Cons of ReLU

  • Leaky ReLU

  • Parametric ReLU

  • Derivative of ReLU (Rectified Linear Unit) function

  • A common non-linear activation function is defined by the operation f(x) = max(0, x). If this function is applied element-wise to the input vector h = [2.7, -1.3, 0, -4.5, 8.1], what is the resulting output vector?

  • A neuron in a neural network computes a pre-activation value (the weighted sum of its inputs plus bias) of -2.8. The neuron then applies an activation function defined by the formula f(z) = max(0, z). Based on this, what will be the neuron's output, and what is the direct consequence for this neuron's learning process during backpropagation for this specific input?

  • A hidden layer in a neural network produces the following vector of pre-activation values for a single neuron across five different training examples: [-3.1, -0.5, 0.8, 2.4, 5.0]. An activation function defined as f(x) = max(0, x) is then applied to this vector. Which statement best analyzes the effect of this function on the information passed to the next layer?

Learn After
  • Why is it better to use ReLU by default?