What is Gini Impurity ?

4 min readMar 30, 2021

Gini impurity is a statistical measure used in Decision Trees to form a tree structure. While forming the tree structure, the algorithm (CART, ID3 etc.) must decide which feature is to be selected first. So in this post, we will take a close look at the main idea behind this selection.

Why do we need this statistical measure ?

The CART algorithm (Classification and Regression Trees) is a greedy algorithm which means that, the algorithm makes decisions with respect to the current state. It greedily searches for the “best” (optimum) split for the current level, and it repeats the same process for each subsequent levels. There is no guarantee that decisions made earlier will lead to optimum boundaries for the entire dataset. Now the question is that how does the CART algorithm approach to the splitting problem so that at the end of the process we have a reasonably good solution ?

To get the optimal split for the current level in the tree, we need a measure to distinguish the reasonably optimal one from the non-optimal ones. There are two common criterions which are used for this purpose:

Gini Impurity (Gini Index)
Entropy

Definition of the Gini Impurity

For a single sample that is taken from the data set, the likelihood of being misclassified if it were given a random class label describes the Gini impurity. Now, we understand that this measure is all about the misclassification rate of the data with respect to all the samples in the given node (we will come this node concept later).

More formally, let D be the dataset containing the samples from k classes. The probability of the samples belonging to class i at a given node can be denoted as pᵢ. Then the Gini impurity is calculated as:

Now, let us look at the implementation of the given formula. From the below table, we can infer that, uniformly distributed data leads to greater impurity score whereas the minimum impurity score is obtained if all the distribution belongs only one class.

https://www.learndatasci.com/glossary/gini-impurity/

Feature Selection with Gini Impurity

Let D be the dataset. If D is split into two subsets, namely D₁ and D₂, with the sizes of n₁ and n₂ with respect to an attribute A, Gini impurity associated with this attribute (feature) is defined as:

Where n is the total number of samples in D. It is the weighted sum of Gini impurities where weights are the ratio of the new sizes to the total size. While training a decision tree, feature with the lowest Gini impurity is selected by the CART.

Interpretation of the Value

Gini impurity takes values on the interval [0, 1]. If the value of gini close to 0, we can say that misclassification probability of that particular data is low, in other words this data point is likely to be classified correctly. If it is close to 1, the interpretation is just the opposite.

How Is It Used in the CART Algorithm ?

For the current node, CART algorithm should split the data optimally if the current node is not a leaf node. For this splitting, it uses the feature that has a minimum Gini impurity. But how CART algorithm decide what value of the bound will lead to optimum split for the current node ? Let’s see this in the Iris dataset.

Decision Tree of Iris Dataset with max_depth = 2

For example, for the node belonging to depth 0, where did 0.8 boundary come from ? How did CART algorithm decide that the 0.8 for the sepal width feature will lead to optimum current split ?

We have just learned that feature selection — decision of which feature will be split at first — is done by the following formula:

This formula is the cost function of the CART algorithm. It tries to minimize this value. Let tₐ be a boundary for the attribute A. Taking different boundaries will lead to having different impurities. Out of all of them, CART will choose the minimum one.

Thanks for reading this article! Leave a comment if you have any questions.