When training a Machine Learning (ML) model such as a binary classification model, we might get an accuracy of 0.99 on the validation set. This might sound amazing, but what if the data is insurance claims and 99% of the claims were not fraudulent, and only 1% of them are? Did the model really learn anything? If we classify everything as not fraudulent then we can get 99% accuracy, but the model mislabeled every fraudulent claim as non-fraudulent.
So how do we handle this imbalance? Again, as with any data science or ML problem, it depends on the data and the problem we are trying to solve. For the example of handling this insurance claim, we have a variety of options. Let's discuss them.
Methods to Handle Imbalanced Data
Bootstrap sampling on the underrepresented class: This increases the datapoints available for the model. Apply some noise to avoid overfitting so the model doesn’t just memorize the training data and then performs poorly on the testing data. One way I have added noise is by generating a random number between 0.95 - 1.05 and multiplying the datapoint in question by this random number. This way the newly resampled data is within 5% of the original datapoint. Do consult the Subject Matter Expert when doing this to make sure we don’t harm the data and get poor model performance. The random number range should also be discussed with your subject matter expert. You might instead need a value that is 2% or even 0.5% off.
Undersample the larger class: Some data points might be redundant, so removing some might help the model. Again, consultation with the subject matter expert is important. This is useful if you have very large training data; removing some might be useful. It depends on the business problem — for example, if doing time-series forecasting, removing a data point by itself might not be feasible, but perhaps removing a sequence instead might be better.
Stratify the train-test split: I personally consider this the truth test to see if a model has learned between the two classes. When doing a train-test split, stratify the target variable so it matches the training data’s ratio of the two classes.
Weight the two classes: Apply weights to the classes to help stabilize the imbalance.
Model Evaluation
In a real-world model, which strategy to use depends a lot on the business problem. Using our insurance fraud example, a good solution might be to stratify target variables during the train-test split. This way the ratio of the two classes in the training set will be identical to the testing set, providing a truth to whether you did a good job engineering features, and if the model can distinguish between the two classes.
Precision: Measures how many of the predicted positive cases are actually positive. High precision means fewer false positives.
Recall (Sensitivity or True Positive Rate): Measures how many actual positive cases the model detects. High recall means fewer false negatives.
F1 Score: The harmonic mean of precision and recall. It balances the trade-off between precision and recall, useful when you need a single metric to evaluate minority class predictions.
Specificity (True Negative Rate): Measures how many actual negative cases the model correctly identifies, useful to understand performance on the majority class.
Area Under the Precision-Recall Curve (AUPRC): Focuses on the minority class performance and is more informative than ROC AUC for severely imbalanced datasets.
Matthews Correlation Coefficient (MCC): A balanced measure that takes all four confusion matrix categories into account and works well even if classes are of very different sizes.
Balanced Accuracy: The average of sensitivity and specificity, which accounts for imbalanced class distributions.