A simple mathematical guide to classification Trees using sklearn | Decision Trees
1. Classification Error Rate for Classification Trees
2. Entropy in Classification tree
3. Information Gain in classification trees
4. Gini Index in Classification Trees
5. Gini Gain in Classification Trees
6. Using Entropy and Information Gain to create Decision tree nodes
7. Using Gini Index and Gini Gain to create Decision tree nodes
8. Classification Trees using Sklearn
As we already know from our previous discussion on Regression Trees, that tree algorithms are Greedy in nature which means they tend to choose the better node now, rather than choosing a node that will create a better tree later.
Also, in contrast to the regression tree model where the predicted response of any given node is the mean of all the observations in the region, in classification trees predicted response is the most commonly occurring observation in the region to which it belongs.
Read the following post for more details.
2. Entropy in Classification tree
3. Information Gain in classification trees
4. Gini Index in Classification Trees
5. Gini Gain in Classification Trees
6. Using Entropy and Information Gain to create Decision tree nodes
7. Using Gini Index and Gini Gain to create Decision tree nodes
8. Classification Trees using Sklearn
A Beginner's guide to Regression Trees using Sklearn | Decision Trees
#python
#machinelearning
#datascience
#sklearn
April 3, 2020
10 mins read
- Classification Error rate
- Entropy
- Gini Index
Classification Error Rate for Classification Trees
Classification Error rate is simply the fraction of the training observations in that region that do not belong to the most common class. Mathematically,
$E = 1 - max(p(k))$
$Where\ p(k)\ is\ the\ proportion\ of\ training\ observations\ in\ the\ mth\ region\ that\ are\ from\ the\ kth\ class$
Classification error rate is not used generally because it is not sensitive for tree-growing, therefore, Entropy
or Gini index
is used instead.
Entropy in Classification tree
It’s the measure of amount of uncertainty in the data(Randomness). Higher the uncertainty, higher is the entropy. The value of entropy iszero
when there is no uncertainty in some event. For example, if we are tossing a coin having heads on both sides.
Mathematically, entropy is given by
$H(s) =\displaystyle \sum_{x \epsilon X} p(x) log_2 \frac{1}{p(x)}$
$where\ p(x)\ is\ the\ proportion\ of\ occurring\ of\ some\ event$
For example, for a simple coin toss, the probability is 1/2
.
Information Gain in classification trees
This is the value gained for a given setS
when some feature A
is selected as a node of the tree.
While selecting any node for the tree generation we want to maximize the Information Gain at that given point.
Information gain is given as the change in the Entropy before and after selecting any given feature as the node of the tree.
Mathematically, it is given as,
$IG(S, A) = H(S) - H(S, A)$
$IG(S, A) = H(S) - \displaystyle \sum_{i=0}^{n} P(x) * H(x)$
$where\ H(S)\ is\ the\ Entropy\ of\ entire\ Set$
$and\ \sum_{i=0}^{n} p(x) * H(x)\ is\ the\ Entropy\ after\ applying\ feature\ x\ where\ P(x)\ is\ the\ proportion\ of\ event\ x$
We will discuss it further while creating the model using Information Gain.
Information gain is the value of entropy that we removed after adding a node to the tree.
Gini Index in Classification Trees
This is the default metric that the Sklearn Decision Tree classifier tends to increase. It is used to quantify the split made in the tree at any given moment of node selection. Mathematically, gini index is given by,
$G = \displaystyle \sum_{k=1}^{K} P(k)(1 - P(k))$
$Where\ P(k)\ is\ the\ proportion\ of\ training\ instances\ with\ class\ k$
Minimum value that the Gini index can have is 0.For example, A coin having heads on both sides will give Gini Index as 0.
$(1 * (1 - 1)) = 0$
Gini index also tells about the purity of node selection. If a node selected is very pure the value of Gini index will be less.
Gini Gain in Classification Trees
As we have information gain in the case of entropy, we have Gini Gain in case of the Gini index. It is the amount of Gini index we gained when a node is chosen for the decision tree. We will take an example to understand these terms in little more detail. Let’s consider the following data source,| Outlook | Temperature | Humidity | Wind | Played |
|----------|-------------|----------|--------|--------|
| Sunny | Hot | High | Weak | No |
| Sunny | Hot | High | Strong | No |
| Overcast | Hot | High | Weak | Yes |
| Rain | Mild | High | Weak | Yes |
| Rain | Cold | Normal | Weak | Yes |
| Rain | Cold | Normal | Strong | No |
| Overcast | Cold | Normal | Strong | Yes |
| Sunny | Mild | High | Weak | No |
| Sunny | Cold | Normal | Weak | Yes |
| Rain | Mild | Normal | Weak | Yes |
| Sunny | Mild | Normal | Strong | Yes |
| Overcast | Mild | High | Strong | Yes |
| Overcast | Hot | Normal | Weak | Yes |
| Rain | Mild | High | Strong | No |
Using Entropy and Information Gain to create Decision tree nodes
Calculate the overall entropy
Total yes cases = 9
Total No cases = 5
Total Cases = 14
$\frac{9}{14}\log _{2} \frac{14}{9} + \frac{5}{14}\log _{2} \frac{14}{5}$
$= 0.940$
More the value of entropy inclined toward 1, more is the randomness in the data.As we have already evaluated the value of total entropy, let’s calculate the information gain while choosing each and every feature separately. Let’s start with the
wind
feature.
$IG(S, Wind) = H(S) - \sum _{i=0}^{n} P(x) * H(x)$
Total Weak wind cases = 8
Total Strong wind cases = 6
Total Cases = 14
Entropy for Weak wind
$H(S_{weak}) = \frac{6}{8} \log_{2}\frac{8}{6} + \frac{2}{8} \log_{2}\frac{8}{2}$
$= 0.811$
Entropy for Strong wind
$H(S_{strong}) = \frac{3}{6} \log_{2}\frac{6}{3} + \frac{3}{6} \log_{2}\frac{6}{3}$
$= 1.00$
Total Information Gain can be calculated as follows,
$IG(S, Wind) = H(S) - P(S_{weak}) * H(S_{weak}) - P(S_{strong}) * H(S_{strong})$
$= 0.940 - \frac {8}{14} (0.811) - \frac{6}{14}(1.00)$
$= 0.048$
IG
for other features as well and select the one which produces the highest value of IG
.
We will continue this process until leaf nodes are reached for every branch created.
Using Gini Index and Gini Gain to create Decision tree nodes
Calculate the overall Gini impurity
Total yes cases = 9
Total No cases = 5
Total Cases = 14
$GI(S) = \frac{9}{14}(1 - \frac{9}{14}) + \frac{5}{14}(1 - \frac{5}{14})$
$= 0.46$
As we have already evaluated the value of total Gini impurity, let’s calculate the Gini gain while choosing each and every feature separately.
Let’s start with the wind
feature.
Total Weak wind cases = 8
Total Strong wind cases = 6
Total Cases = 14
Gini Index for Weak wind
$GI(S_{weak}) = \frac{6}{8}(1 - \frac{6}{8}) + \frac{2}{8}(1 - \frac{2}{8})$
$= \frac{3}{8}$
Gini Index for Strong wind
$GI(S_{Strong}) = \frac{3}{6}(1 - \frac{3}{6}) + \frac{3}{6}(1 - \frac{3}{6})$
$= \frac{1}{2}$
Total Gini Gain can be calculated as follows,
$GG(S_{wind}) = GI(S) - \frac{8}{14} * GI(S_{weak}) - \frac{6}{14} * GI(S_{strong})$
$GG(S_{wind}) = 0.46 - 0.214 - 0.214$
$= 0.0314$
GG
for other features as well and select the one which produces the highest value of GG
.
This is the basic understanding of Classification Trees.
Classification Trees using Sklearn
We will be using sklearn to train a model of breast cancer data.Import everything
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
%matplotlib inline
Load data and split it into training and testing data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
Initialize the model with default criterion (Gini Index)
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
Predict the model and check the Reports
predictions = clf.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
Using Entropy as the criterion for predicting classification Model
clf_entropy = DecisionTreeClassifier(criterion='entropy')
clf_entropy.fit(X_train, y_train)
predictions_entropy = clf_entropy.predict(X_test)
print(confusion_matrix(y_test, predictions_entropy))
print(classification_report(y_test, predictions_entropy))
About Author
Ranvir Singh
Greetings! Ranvir is an Engineering professional with 3+ years of experience in Software development.
Original Source: Original Post
Please share your Feedback:
Did you enjoy reading or think it can be improved? Don’t forget to leave your thoughts in the comments section below! If you liked this article, please share it with your friends, and read a few more!