Class Imbalance: Removing bias from sample data

Jaishri Rai
GoPenAI
Published in
3 min readMay 24, 2023

--

Class imbalance refers to a situation in a classification problem where the distribution of classes in the target variable is significantly skewed or uneven. It occurs when the number of observations belonging to one class is much higher or lower than the number of observations in other classes.

Let’s say we are conducting an online survey in Maharashtra about the choices of people for the suitable political party in power. Since digital penetration is more in urban areas as compared to rural areas, the outcomes of the data collected will be biased toward the urban crowd. Furthermore, in rural areas, female participants will be much lesser than male participants. Rural female participants will be in the minority and therefore understanding the opinion of this section will be based on the smaller data set.

Class imbalance is a common issue in various real-world scenarios, including fraud detection, disease diagnosis, rare event prediction, and anomaly detection. It can pose challenges for machine learning algorithms because they tend to be biased towards the majority class, leading to poor performance in predicting the minority class. The imbalanced class may receive less attention during the training process, resulting in a model that has high accuracy on the majority class but performs poorly on the minority class.

How class imbalance can be rectified?

  1. Resampling: This involves either oversampling the minority class (creating synthetic instances of the minority class) or undersampling the majority class (removing instances from the majority class). The goal is to rebalance the class distribution and provide a more equal representation of the classes during training.
  2. Class Weighting: Assigning higher weights to the minority class during model training can give it more importance and help alleviate the bias towards the majority class. This way, misclassifications in the minority class have a larger impact on the overall model performance.
  3. Algorithmic Approaches: Some machine learning algorithms have built-in mechanisms to handle class imbalance, such as decision tree algorithms with balancing techniques (e.g., random forests) or algorithms that use class-specific costs or penalties for misclassification.
  4. Anomaly Detection: In cases where the minority class represents rare events or anomalies, specialized anomaly detection algorithms can be employed to identify and handle these instances separately from the majority class.

In the first method, resampling, one is working on the raw sample dataset. Modifications are happening in the raw data itself. In the second method, class weighting, we are handling the misclassification by assigning the right weightage. The third method, algorithmic, is interesting. Let’s dive in brief into this particular method.

Studying Algorithmic Approaches: For rectifying class imbalance

Decision tree algorithms, such as Random Forest, can employ balancing techniques to address class imbalance and incorporate class-specific costs or penalties during the model training process. Here’s an explanation of how decision tree algorithms can use these techniques:

  1. Bagging and Random Forest (Stratefied Sampling): Decision tree algorithms like Random Forest use an ensemble learning technique called bagging, where multiple decision trees are trained on different subsets of the training data. Each tree in the forest is trained independently, and the final prediction is determined by aggregating the predictions of all the individual trees.

To handle class imbalance, bagging can be enhanced by modifying the sampling process. Instead of randomly sampling from the entire training set, the sampling can be stratified to ensure a more balanced representation of the classes in each subset. This means that each subset used to train an individual decision tree will have a similar proportion of instances from each class.

By using stratified sampling, the decision trees in the Random Forest will have exposure to both the majority and minority classes during training. This can help improve the model’s ability to capture patterns and make accurate predictions for both classes.

2. Cost-Sensitive Learning: Decision tree algorithms can also incorporate class-specific costs or penalties during the training process. This approach, known as cost-sensitive learning, assigns different costs to different types of misclassifications based on the importance of each class.

For example, in a fraud detection problem with imbalanced classes, misclassifying a fraudulent transaction as non-fraudulent (a false negative) may have severe consequences, while misclassifying a non-fraudulent transaction as fraudulent (a false positive) may have less impact. In this case, a higher cost or penalty can be assigned to the false negative errors compared to the false positive errors.

--

--

Someone who wants to dig deep in hope that one day my thoughts, my resentments will become part of my armory to make someone’s life better.