top of page

Radial Kernel in Support Vector Machines

Support vector machines are a famous and a very strong classification technique which does not use any sort of probabilistic model like any other classifiers but simply generates hyperplanes or simply putting lines, to separate and classify the data in some feature space into different regions.

Support Vector Classifiers are majorly used for solving binary classification problems where we only have 2 class labels say Y=[−1,1] and a bunch of predictors Xi. And what SVM does is that it generates hyperplanes which in simple terms are just straight lines or planes or are non-linear curves, and these lines are used to separate the data or divide the data into 2 categories or more depending on the type of classification problem.


Another important concept in SVM is maximal margin classifiers. What it means is that among a set of separating hyperplanes SVM aims at finding the one which maximizes the margin M. This simply means that we want to maximize the gap or the distance between the 2 classes from the decision boundary(separating plane). This concept of separating data linearly into 2 different classes using a linear separator or a straight linear line is called linear separability.


The term support vectors in SVM are the data points or training examples which are used to define or maximizing the margin. The support vectors are the points which are close to the decision boundary or on the wrong side of the boundary.


But it is often encountered that linear separators and non-linear decision boundaries fail because of the non-linear interactions in the data and the nonlinear dependence between the features in feature space.

In this tutorial, I am going to talk about generating non-linear decision boundaries which are able to separate nonlinear data using radial kernel support vector classifier.

So how do we separate non-linear data?


The trick here is by doing feature expansion. So the way we solve this problem is by doing a non-linear transformation on the features Xi and converting them to a higher dimensional space (say 2-D to 3-D space) called a feature space. Now by this transformation, we are able to separate non-linear data using a non-linear decision boundary.

One famous and most general way of adding non-linearity in a model to capture non-linear interactions is by simply using higher degree terms such as square and cubic polynomial terms.

Above equation is the equation of the non-linear hyperplane which is generated if we use higher degree polynomial terms to fit the data to get a non-linear decision boundary.

What we are actually doing is that we are fitting an SVM is an enlarged space. We enlarge the space of features by fitting non-linear functions in predictors Xi. But the problem with polynomials is that in higher dimension i.e when having lots of predictors it gets wild and generally overfits at higher degrees of the polynomial.


Hence there is another elegant way of adding non-linearity in SVM is by the use of Kernel trick.


Kernel function

Kernel function is a function of form

, where d is the degree of the polynomial.


Now the type of Kernel function we are going to use here is a Radial kernel. It is of form

, and γ (gamma) here is a tuning parameter which accounts for the smoothness of the decision boundary and controls the variance of the model.

If γ is very large then we get quite fluctuating and wiggly decision boundaries which accounts for high variance and over-fitting. If γ is small, the decision line or boundary is smoother and has low variance.


So now the equation of the support vector classifier becomes

Here S is the support vectors and α is simply a weight value which is non-zero for all support vectors and otherwise 0.


Conclusion

Radial kernel support vector machine is a good approach when the data is not linearly separable. The idea behind generating non-linear decision boundaries is that we need to do some nonlinear transformations on the features Xi which transforms them into a higher dimensional space. We do this non-linear transformation using the Kernel trick. Now the performance of SVM is influenced by the values of 2 tuning parameters including the regularization parameter C and γ. We can implement cross-validation to find the best values of both these tuning parameters which affect our classifier’s C(X) performance. Another way of finding the best value for these hyper-parameters is by using certain optimization techniques such as Bayesian Optimization.


End Note

The above part is originated from the article of Anish in the following link. This is a gentle introduction about Support Vector Machines, which does not dig too deep into the mathematics behind this concept but rather explains how it works in general. I also created a tutorial in Kaggle to illustrate how SVM is deployed for classification using r library "e1071". Let's have a look here!









bottom of page