SVM | Introduction to Support Vector Machines with Sklearn in Machine Learning
SVM or support vector machines are supervised learning models that analyze data and recognize patterns on its own. They are used for both classification and regression analysis. An SVM model is the representation of the dataset as points in space so that the example of the separate categories is divided by a clear gap which is as wide as possible.
1. Maximal Margin Classifier
2. Support vector classifier
3. Support Vector Machines
4. Support Vector Machine for more than two classes
5. Support Vector Machine using Sklearn
Any new incoming data is then mapped to one of these few categories based on which side of the gap they fall on.
2. Support vector classifier
3. Support Vector Machines
4. Support Vector Machine for more than two classes
5. Support Vector Machine using Sklearn
- Maximal Margin Classifier
- Support Vector Classifier
- Support Vector Machine
Maximal Margin Classifier
Maximal Margin Classifier is a model that is used to classify the observations into two parts using a hyperplane.
What is a Hyperplane?
Simply put, a hyperplane is a subspace inp-dimensional
space having p -1
dimensions. For example, in two-dimensional space, the hyperplane will be of 1 dimension, or it will be a line. Similarly, in the case of 3 dimensions, it will be a two-dimensional plane.
In two dimensions the equation of the hyperplane are given by,
$\beta_0 + \beta_1X_1 + \beta_2X_2 = 0$
$where\ vector\ (X1,\ X2)\ is\ on\ the\ hyperplane$
We can also find some similarities of this equation with the equation of a line.
It’s fairly easy to extend this equation and find the equation of a hyperplane in p
dimensions.
$\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p = 0$
Now if,
$\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p = 0 > 0$
Then the vector is on the one side of the hyperplane and if,
$\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p = 0 < 0$
The vector is on the other side of the plane.
To sum it up, our main aim in case of Maximal Margin Classifier is to create a hyperplane fitted on a training data of n X p
matrix X
, containing n
training observations in p
-dimensional space such that all these vectors falls in one of the two classes divided by the hyperplane.
If we represent the classes(labels) for all the n values,
$y_1, ..., y_n\ \epsilon \{-1, 1\}$
Where -1 represents one class and 1 represents the other class.
Our main aim for any incoming test vector,
$x^* = (x_1^*\ ...\ x_p^*)^T$
is that our model has to allot this incoming test vector to one of the two classes. This equation given the class of the incoming test vector.
$f_x^* = \beta_0 + \beta_1x_{1}^* + \beta_2x_{2}^* + ... + \beta_px_{p}^*$
If the value of this function is positive, we assign it to class 1, otherwise, we assign it to class -1.
A simple issue in this approach is that there are infinite number of hyperplanes possible that can divide a perfect distribution.
margin
from both the sides is chosen as the hyperplane.
p
is large.
We have already discussed that the points on dashed line are called support vectors and it has been found that the position of hyperplane only depends on support vector and is not dependent on the other observations in the dataset.
This is how we define a Maximal margin classifier. There are a few issues with Maximal Margin Classifier.
- It doesn’t work on observations where no clear hyperplane is present between different classes.
- A small addition of observation near the hyperplane can lead to a lot of change in the hyperplane making it a lot volatile.
Support vector classifier
In case of Support vector classifiers, we allow a few observations to be on the wrong side of hyperplane making the model a little more robust to individual observation and helps us to better classify other and most of the observations.Support vector classifier is also known as a soft margin classifier.The observations on the wrong side of the hyperplane are obviously misclassified by the model. But this helps to improve the overall accuracy of the model.
$y_i(\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ... + \beta_px_{ip}) \geq M(1 - \epsilon_{i})$
$where\ \epsilon_i \geq 0\ and\ \displaystyle \sum _{i=1}^{n} \epsilon_i \leq C$
For any given observation vector on either side of plane, epsilon
( also called a slack variable), gives the point at which it is located, relative to the hyperplane and margin.
If i
th slack variable is on the right side of the hyperplane then the value of that variable is 0. Also, if
$\epsilon_i > 0$
then the point i
is on the wrong side of the margin. But if,
$\epsilon_i > 1$
then the slack variable is on the wrong side of the hyperplane.
If we extend this observation to the tuning variable, C
, we can deduce that C
is the number which determines the count and severity of the violations to the margins and the hyperplane.
The value of C
is considered as the tuning parameter which is generally chosen by cross-validation. C
also controls the bias-variance trade-off for the model.
If the value of C
is small, we allow a lesser number of observations to be on the wrong side which will fit perfectly to a data set having data with high bias and low variance and vice-versa.
Again similar to Maximal Margin Classifier, it was found that all the observations don’t get to decide the position of a hyperplane of the Margin. It is only dependent on the observations on or inside the margins.
If we expand these points a little we can get to the Support Vector Machines. Let’s discuss them in some detail.
Support Vector Machines
In the Support vector Machine, we introduce another factor called the kernel, which is the result of enlarging of support vector classifiers in a specific way. According to our discussions in support vector classifier, its equation can be re-written as,
$f(x) = \beta_0 +\displaystyle \sum_{i=1}^n \alpha_i<x, x_i>$
$where\ <x, x_i>\ is\ the\ inner\ product\ between\ the\ new\ point\ x\ and\ other\ x_i\ points$
The implementation of the inner product is hidden on purpose and we should be good without knowing the details of it.We can directly replace all the instances of the inner product with a general term called the kernel.
$f(x) = \beta_0 +\displaystyle \sum_{i \epsilon S} \alpha_iK(x, x_i)$
as only support vectors are responsible for the creation of the hyperplane.
For p
planes equation of kernel becomes,
$K(x_i, x_{i^`}) = (1 +\displaystyle \sum_{j=1}^p x_{ij}x{i`j} )^d$
which is known as a polynomial kernel of degree d
. This type of model leads to much flexible decision boundary.
Support Vector Machine for more than two classes
During the discussion for support vector machines, we haven’t really talked about the case when the number of possible classifications can be more than 2. We can solve these problems by extending the simple SVM in two ways.- One versus One Classification
- One versus All Classification
K
class to the remaining K - 1
classes. Finally, we will assign any upcoming test vector to the class which produces the highest values of the constant, or we want to maximize.
$\beta_0k +\beta_{1k}x_1^* +\beta_{2k}x_2^* +...+\beta_{pk}x_p^*$
Support Vector Machine using Sklearn
Now we will try to train a Support Vector Machine model using sklearn. We are going to work on the already available cancer data which we have used in other posts as well. So, the idea behind the data is that, we have been given some cancer patients info, for example what is the perimeter, radius and whole lot of other stuff about the cancer. Our goal is to find whether the cancer isWDBC-Malignant
or WDBC-Benign
.
Let’s import everything first.
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
df = pd.DataFrame(cancer['data'], columns=cancer['feature_names'])
df.head()
print(cancer['DESCR'])
cancer['target']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, cancer['target'], test_size=0.40)
from sklearn.svm import SVC
model = SVC()
model.fit(X_train, y_train)
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
predictions = model.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
About Author
Ranvir Singh
Greetings! Ranvir is an Engineering professional with 3+ years of experience in Software development.
Original Source: Original Post
Please share your Feedback:
Did you enjoy reading or think it can be improved? Don’t forget to leave your thoughts in the comments section below! If you liked this article, please share it with your friends, and read a few more!