Evaluation of our synthetic data by
SAS data experts

Our synthetic data is assessed and
approved by the data experts of SAS

Book a demo

Conclusions by the data
experts of SAS

Syntho’s synthetic data has been rigorously evaluated and approved by SAS data experts, affirming its accuracy and usability.

Synthetic vs. Original Performance
Synthetic vs. Original Performance
Models trained on synthetic data compared to the models trained on original data show highly similar performance
Anonymized Data Performance Gap
Anonymized Data Performance Gap
Models trained on anonymized data with ‘classic anonymization techniques’ show inferior performance compared to models trained on the original data or synthetic data
Fast Synthetic Data Generation
Fast Synthetic Data Generation
Synthetic data generation is easy and fast because the technique works exactly the same per dataset and per data type

Initial results of the data assessment by SAS

Models trained on synthetic data score
highly similar in comparison to models
trained on original data

The AI algorithm learns patterns and relationships from real-world data to generate new, synthetic data that mimics these characteristics closely. This synthetic data is so accurate that it can be used for advanced analytics, acting as a “synthetic data twin” that functions like real-world data.

Why do models trained on anonymized data score worse?

Classic anonymization techniques have in common that they manipulate original data in order to hinder tracing back individuals. They manipulate data and thereby destroy data in the process. The more you anonymize, the better your data is protected, but also the more your data is destroyed.

This is especially devastating for AI and modeling tasks where “predictive power” is essential, because bad quality data will result in bad insights from the AI model. SAS demonstrated this, with an area under the curve (AUC*) close to 0.5, demonstrating that the models trained on anonymized data perform by far the worst.

Why do models trained on anonymized data score worse?

Our synthetic data is approved by the data experts of SAS

Learn more

What did SAS do during this assessment?

Synthetic data generated by Syntho is assessed, validated and approved from an external and objective point of view by the data experts of SAS.

01
Telecom Data as Target

We used telecom data for “churn” prediction, focusing on how synthetic data could be utilized to train models and assess their performance.

02
Model Selection

SAS selected popular classification models for the prediction:
Random forest

  • Gradient boosting
  • Logistic regression
  • Neural network
03
Data Splitting

Before generating synthetic data, the telecom dataset was randomly split into:

  • Train Set: Used for training the models.
  • Holdout Set: Used for unbiased model scoring.
04
Generating Synthetic and Anonymized Data

Syntho generated a synthetic dataset using the train set. Additionally, SAS created an anonymized dataset using the same data, resulting in four datasets:

  • Original Train Dataset
  • Holdout Dataset
  • Anonymized Dataset
  • Synthetic Dataset
05
Model Training

Each dataset (original, anonymized, and synthetic) was used to train the churn prediction models. This resulted in a total of 12 trained models (3 datasets x 4 models). The models were trained using their respective datasets to evaluate how well they could predict churn outcomes. After training, the models’ accuracy was assessed using the holdout dataset to ensure unbiased performance evaluation across all models and datasets.

06
Model Performance Evaluation

SAS evaluated the accuracy of each model using the holdout dataset, measuring the predictive performance of customer churn. They also conducted detailed evaluations of data accuracy, privacy protection, and usability, concluding that Syntho’s synthetic data was accurate, secure, and usable compared to the original data.

Additional results of synthetic data assessments by SAS

Synthetic data generated by Syntho is assessed, validated and<br>approved from an external and objective point of view by the data experts of SAS.

Correlations

The correlations and relationships between variables were accurately preserved in synthetic data.

title-color-part

The correlations and relationships between variables were accurately preserved in synthetic data.

Reference articles

Assessment by the data experts of SAS
Assessment by the data experts of SAS
Syntho winner of the SAS global hackathon
Syntho winner of the SAS global hackathon
Healthcare case study<br>results
Healthcare case study
results
Assessment by the data experts of SAS
Assessment by the data experts of SAS

Frequently Asked Questions

What is upsampling?

Upsampling increases the number of data samples in a dataset, aiming to correct imbalanced data and improve model performance. Also known as oversampling, this technique addresses class imbalance by adding data from minority classes until all classes are equal in size. Both Python’s scikit-learn and Matlab offer built-in functions for implementing upsampling techniques.
It’s important to note that upsampling in data science is often mistaken for upsampling in digital signal processing (DSP). While both processes involve creating more samples, they differ in execution. In DSP, upsampling generates more samples in the frequency domain from a discrete-time signal by interpolating higher sampling rates. This is done by inserting zeros into the original signal and using a low-pass filter for interpolation, unlike data balancing upsampling.
Similarly, upsampling in data balancing is distinct from upsampling in image processing. In image processing, high-resolution images are first reduced in resolution (by removing pixels) for faster computations, and then convolution is used to return the image to its original dimensions (by adding back pixels).

Why use upsampling?

Upsampling is an effective method to address imbalances within a dataset. An imbalanced dataset occurs when one class is significantly underrepresented relative to the true population, creating unintended bias. For example, consider a model trained to classify images as either cats or dogs. If the dataset comprises 90% cats and 10% dogs, cats are overrepresented. A classifier that predicts “cat” for every image would achieve 90% accuracy for cats but 0% accuracy for dogs. This imbalance causes classifiers to favor the majority class’s accuracy at the minority class’s expense. The same issue can arise in multi-class datasets.

Upsampling mitigates this problem by increasing the number of samples for the underrepresented minority class. It synthesizes new data points based on the characteristics of the original minority class, balancing the dataset by ensuring an equal ratio of samples across all classes.

While plotting the counts of data points in each class can reveal imbalances, it doesn’t indicate the extent of their impact on the model. Performance metrics are essential for evaluating how well upsampling corrects class imbalance. These metrics are often used in binary classification, where one class (usually the positive class) is the minority and the other (the negative class) is the majority. Two popular metrics for assessing performance are Receiver Operating Characteristic (ROC) curves and precision-recall curves.

Advantages and disadvantages of upsampling

Advantages

  • No Information Loss: Unlike downsampling, which removes data points from the majority class, upsampling generates new data points, avoiding any information loss.
  • Increase Data at Low Costs: Upsampling is especially effective, and is often the only way, to increase dataset size on demand in cases where data can only be acquired through observation. For instance, certain medical conditions are simply too rare to allow for more data to be collected.

Disadvantages

  • Overfitting: Because upsampling creates new data based on the existing minority class data, the classifier can be overfitted to the data. Upsampling assumes that the existing data adequately captures reality; if that is not the case, the classifier may not be able to generalize very well.
  • Data Noise: Upsampling can increase the amount of noise in the data, reducing the classifier’s reliability and performance. 2
  • Computational Complexity: By increasing the amount of data, training the classifier will be more computationally expensive, which can be an issue when using cloud computing.2
Upsampling techniques

Random Oversampling

Random oversampling involves duplicating random data points in the minority class until it matches the size of the majority class. Though similar to bootstrapping, random oversampling differs in that bootstrapping resamples from all classes, while random oversampling focuses exclusively on the minority class. Thus, random oversampling can be seen as a specialized form of bootstrapping.

Despite its simplicity, random oversampling has limitations. It can lead to overfitting since it only adds duplicate data points. However, it has several advantages: it is easy to implement, does not require making assumptions about the data, and has low time complexity due to its straightforward algorithm.

 

SMOTE

The Synthetic Minority Oversampling Technique (SMOTE), proposed in 2002, synthesizes new data points from the existing points in the minority class. The process involves:

  1. Finding the K nearest neighbors for all minority class data points (K is usually 5).
  2. For each minority class data point:
    1. Selecting one of its K nearest neighbors.
    2. Picking a random point on the line segment connecting these two points in the feature space to generate a new output sample (interpolation).
    3. Repeating the selection and interpolation steps with different nearest neighbors, depending on the desired amount of upsampling.

SMOTE addresses the overfitting problem of random oversampling by adding new, previously unseen data points rather than duplicating existing ones. This makes SMOTE a preferred technique for many researchers. However, SMOTE’s generation of artificial data points can introduce extra noise, potentially making the classifier more unstable. Additionally, the synthetic points can cause overlaps between minority and majority classes that do not reflect reality, leading to over-generalization.

 

Borderline SMOTE

Borderline SMOTE is a popular extension of the SMOTE technique designed to reduce artificial dataset noise and create ‘harder’ data points—those close to the decision boundary and therefore more challenging to classify. These harder data points are particularly beneficial for the model’s learning process.

Borderline SMOTE works by identifying minority class points that are close to many majority class points and grouping them into a DANGER set. These DANGER points are difficult to classify due to their proximity to the decision boundary. The selection process excludes points whose nearest neighbors are exclusively majority class points, as these are considered noise. Once the DANGER set is established, the SMOTE algorithm is applied as usual to generate synthetic data points from this set.

Comparison of current upsampling method

1. Naive Oversampling:

  • Description: Involves randomly selecting certain samples from minority groups and duplicating them in the dataset. This helps achieve a more balanced distribution of data by increasing the representation of the minority class.
  • When to Use: Naive oversampling is relevant when a simple approach to balance the dataset is needed, especially when computational resources or complexity need to be kept low and the risk of overfitting is not a concern.

 

2. SMOTE (Synthetic Minority Over-sampling Technique) [1]:

  • Description: SMOTE generates synthetic samples for the minority class by first identifying the k nearest neighbors of each minority class sample. It then creates new synthetic samples along the line segments connecting these minority samples to their neighbors, thereby introducing new, plausible examples and balancing the dataset.
  • When to Use: SMOTE is more relevant when there is a need to enhance the minority class representation in a way that preserves the structure and characteristics of the data, especially in datasets with numerical features.
  • Variants:
    • SMOTE-NC: Used for datasets containing both numerical and categorical features.
    • SMOTEN: Used for datasets with categorical features only.

 

3. ADASYN (Adaptive Synthetic Sampling) [2]

  • Description: ADASYN uses a weighted distribution for different minority class examples according to their learning difficulty. It generates more synthetic data for minority class examples that are harder to learn compared to those that are easier to learn.
  • When to Use: ADASYN is more relevant when dealing with imbalanced datasets where certain minority class examples are more difficult to classify and require additional synthetic samples for better learning.

 

4. Synthetic Data

  • Description: Synthetic data refers to artificially generated data that mimics the properties of real data. It can be used to supplement or replace real data for various purposes, including training machine learning models.
  • When to Use: Synthetic data is relevant when there are concerns about data privacy, when real data is scarce or expensive to obtain, or when creating balanced datasets for training machine learning models. It is also suitable for mitigating overfitting, addressing rare events, reducing bias, and complying with regulatory requirements.

 

References:

[1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002.

[2] He, Haibo, Yang Bai, Edwardo A. Garcia, and Shutao Li. “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” In IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322-1328, 2008.

Rebalancing your data for ML

Data rebalancing involves redistributing data across nodes or partitions in a distributed system to ensure optimal resource utilization and balanced load. As data is added, removed, or updated, or as nodes are added or removed, imbalances can arise. These imbalances may lead to hotspots, where some nodes are heavily used while others are under-utilized, or inefficient data access patterns.

 

Why is Data Rebalancing Important?

  • Performance Optimization: Without rebalancing, some nodes can become overloaded while others remain under-utilized, creating performance bottlenecks.
  • Fault Tolerance: In distributed storage systems like Hadoop’s HDFS or Apache Kafka, data is often replicated across multiple nodes for fault tolerance. Proper rebalancing ensures that data replicas are well-distributed, enhancing the system’s resilience to node failures.
  • Scalability: As a cluster grows or shrinks, rebalancing helps efficiently integrate new nodes or decommission old ones.
  • Storage Efficiency: Ensuring data is evenly distributed maximizes the use of available storage capacity across the cluster.