Activation functions and when to use them

Patrick Ho
Jun 13, 2019
5 min read

Activation function is an important concept in machine learning, especially in deep learning. They basically decide whether a neuron should be activated or not and introduce non-linear transformation to a neural network. The main purpose of these functions is to convert an input signal of a neuron and produce an output to feed in the next neuron in the next layer. The following pictures will show how an activation function works in a neural network.

There are many kinds of activation function that can be used in neural networks as well as some machine learning algorithms like logistic regression. In this articles, I will explain some commonly used functions such as Sigmoid, Tanh, ReLU, and Softmax, and introduce some useful cheat sheets that I have collected from multiple sources.

The first cheat sheet provides derivative form of each function. So why do we need derivative/differentiation here?

When updating the curve, to know in which direction and how much to change or update the curve depending upon the slope. That is why we use differentiation in almost every part of Machine Learning and Deep Learning.

Basically, it means that we will know how much a function changes when we change it's input. The next cheat sheet provides some examples of models using activation functions. Thanks to Sebastian Raschka.

1. Sigmoid (Logistic)

Sigmoid function takes a real-valued number and “squashes” it into range between 0 and 1, i.e., σ(x)∈(0,1)σ(x)∈(0,1). In particular, large negative numbers become 0 and large positive numbers become 1. Moreover, the sigmoid function has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1). It is especially used for models where we have to predict the probability as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.

Pros

It is nonlinear in nature. Combinations of this function are also nonlinear!
It will give an analog activation unlike step function.
It has a smooth gradient too.
It’s good for a classifier.
The output of the activation function is always going to be in range (0,1) compared to (-inf, inf) of linear function. So we have our activations bound in a range. Nice, it won’t blow up the activation then.

Cons

Towards either end of the sigmoid function, the Y values tend to respond very less to changes in X.
It gives rise to a problem of “vanishing gradients”.
Its output isn’t zero centered. It makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimization harder.
Sigmoids saturate and kill gradients.
The network refuses to learn further or is drastically slow ( depending on use case and until gradient /computation gets hit by floating point value limits ).

2. Tanh (Hyperbolic tangent)

This is an alternative to Sigmoid function but later is a better version of former. The range value of a Tanh function is from -1 to 1. Tanh function is also sigmoidal and mainly used for classification between two classes.

Pros : The gradient is stronger for tanh than sigmoid ( derivatives are steeper).
Cons: Tanh also has the vanishing gradient problem

3. ReLU - Rectified Linear Unit.

Generally speaking, this is the most widely used activation function in deep learning. the ReLU function is non linear, which means we can easily backpropagate the errors and have multiple layers of neurons being activated by the ReLU function.

Range value of ReLU value is from 0 to ∞

The main advantage of using the ReLU function over other activation functions is that it does not activate all the neurons at the same time. What does this mean? If you look at the ReLU function if the input is negative it will convert it to zero and the neuron does not get activated. This means that at a time only a few neurons are activated making the network sparse making it efficient and easy for computation.

Pros

It avoids and rectifies vanishing gradient problem.
ReLu is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations.

Cons

One of its limitation is that it should only be used within Hidden layers of a Neural Network Model.
Some gradients can be fragile during training and can die. It can cause a weight update which will makes it never activate on any data point again. Simply saying that ReLu could result in Dead Neurons.
In another words, For activations in the region (x<0) of ReLu, gradient will be 0 because of which the weights will not get adjusted during descent. That means, those neurons which go into that state will stop responding to variations in error/ input ( simply because gradient is 0, nothing changes ). This is called dying ReLu problem.
The range of ReLu is [0, inf). This means it can blow up the activation.

4. Softmax

This is a more generalized logistic activation function which is used for multiclass classification.

In mathematics, the softmax function, also known as softargmax or normalized exponential function, is a function that takes as input a vector of K real numbers, and normalizes it into a probability distribution consisting of K probabilities. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities. Softmax is often used in neural networks, to map the non-normalized output of a network to a probability distribution over predicted output classes.

In probability theory, the output of the Softmax function can be used to represent a categorical distribution, that is, a probability distribution over K different possible outcomes. In fact, it is the gradient-log-normalizer of the categorical probability distribution. Here is an example of Softmax application

Source: Isaac Changhau

Additional note:

Depending upon the properties of the problem we might be able to make a better choice for easy and quicker convergence of the network.

Sigmoid functions and their combinations generally work better in the case of classifiers
Sigmoids and tanh functions are sometimes avoided due to the vanishing gradient problem
ReLU function is a general activation function and is used in most cases these days
If we encounter a case of dead neurons in our networks the leaky ReLU function is the best choice
Always keep in mind that ReLU function should only be used in the hidden layers
As a rule of thumb, you can begin with using ReLU function and then move over to other activation functions in case ReLU doesn’t provide with optimum result

Reference links:

https://isaacchanghau.github.io/post/activation_functions/

https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6

https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html

http://cs231n.github.io/neural-networks-1/

https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b

Dive in Data Science

if data == ocean: data_scientist = diver

What is the difference between DevOps and Agile? Why data scientists need to understand these concep

Make the most of 30 day free trial on Azure Services - Big Data Solutions with Azure Machine Learnin

Activation functions and when to use them

Activation functions and when to use them

Comments