If there is no activation function in the neural network, it could be reduced to one layer network. So non-linearity is essential in NN.
$$ \forall A_i \exist B \mid A_n \times A_{n-1} \times ... \times A_1 \times X = B \times X $$
‣ Question
$$ \sigma(x) = \frac{1}{1+e^{-x}} \\ Range \in (0, 1) $$
https://www.desmos.com/calculator/bi07ig0wes
✖️ The gradient for very high and very negative values is zero. So saturated neurons kill the gradient in backpropagation. I think this problem leads to a lower learning rate.
✖️ Sigmoid outputs are not zero-centered. So local gradients on W are always positive. This limits gradient direction and a low learning rate.
✖️ $e^x$ is computationally expensive.
$$ \tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} = \frac{e^{2x}-1}{e^{2x}+1} \\ Range \in (-1, 1) $$
https://www.desmos.com/calculator/zqoehg93q6
$tanh(x)$ is scaled version of $σ(x)$.
$$ \tanh(x) = 2\sigma(2x)-1 $$
✔️ zero centered
✖️ saturated neurons kill the gradient