Random Oversampling
Random oversampling involves duplicating random data points in the minority class until it matches the size of the majority class. Though similar to bootstrapping, random oversampling differs in that bootstrapping resamples from all classes, while random oversampling focuses exclusively on the minority class. Thus, random oversampling can be seen as a specialized form of bootstrapping.
Despite its simplicity, random oversampling has limitations. It can lead to overfitting since it only adds duplicate data points. However, it has several advantages: it is easy to implement, does not require making assumptions about the data, and has low time complexity due to its straightforward algorithm.
SMOTE
The Synthetic Minority Oversampling Technique (SMOTE), proposed in 2002, synthesizes new data points from the existing points in the minority class. The process involves:
- Finding the K nearest neighbors for all minority class data points (K is usually 5).
- For each minority class data point:
- Selecting one of its K nearest neighbors.
- Picking a random point on the line segment connecting these two points in the feature space to generate a new output sample (interpolation).
- Repeating the selection and interpolation steps with different nearest neighbors, depending on the desired amount of upsampling.
SMOTE addresses the overfitting problem of random oversampling by adding new, previously unseen data points rather than duplicating existing ones. This makes SMOTE a preferred technique for many researchers. However, SMOTE’s generation of artificial data points can introduce extra noise, potentially making the classifier more unstable. Additionally, the synthetic points can cause overlaps between minority and majority classes that do not reflect reality, leading to over-generalization.
Borderline SMOTE
Borderline SMOTE is a popular extension of the SMOTE technique designed to reduce artificial dataset noise and create ‘harder’ data points—those close to the decision boundary and therefore more challenging to classify. These harder data points are particularly beneficial for the model’s learning process.
Borderline SMOTE works by identifying minority class points that are close to many majority class points and grouping them into a DANGER set. These DANGER points are difficult to classify due to their proximity to the decision boundary. The selection process excludes points whose nearest neighbors are exclusively majority class points, as these are considered noise. Once the DANGER set is established, the SMOTE algorithm is applied as usual to generate synthetic data points from this set.