Activation function

If there is no activation function in the neural network, it could be reduced to one layer network. So non-linearity is essential in NN.

$$ \forall A_i \exist B \mid A_n \times A_{n-1} \times ... \times A_1 \times X = B \times X $$

‣ Question

Sigmoid

$$ \sigma(x) = \frac{1}{1+e^{-x}} \\ Range \in (0, 1) $$

https://www.desmos.com/calculator/bi07ig0wes

✖️ The gradient for very high and very negative values is zero. So saturated neurons kill the gradient in backpropagation. I think this problem leads to a lower learning rate.

✖️ Sigmoid outputs are not zero-centered. So local gradients on W are always positive. This limits gradient direction and a low learning rate.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/750686fd-ed56-417b-8048-54da1c42c657/moving.png

✖️ $e^x$ is computationally expensive.

tanh

$$ \tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} = \frac{e^{2x}-1}{e^{2x}+1} \\ Range \in (-1, 1) $$

https://www.desmos.com/calculator/zqoehg93q6

$tanh(x)$ is scaled version of $σ(x)$.

$$ \tanh(x) = 2\sigma(2x)-1 $$

✔️ zero centered

✖️ saturated neurons kill the gradient

ReLU (Rectified Linear Unit)