Hi, how can we help you today?

Browse our resources, search the knowledge base

Synthetic Data
What is the difference between synthetic data (a synthetic data twin) and mock data?

Mock data and AI-generated synthetic data are both types of synthetic data, but they are generated in different ways and serve different purposes.

Mock data is a type of synthetic data that is manually created and is often used for testing and development purposes. It is typically used to simulate the behavior of real-world data in a controlled environment and is often used to test the functionality of a system or application. It is often simple, easy to generate, and does not require complex models or algorithms. Often, one referrers also to mock data as “dummy data” or “fake data”.

AI-generated synthetic data, on the other hand, is generated using artificial intelligence techniques, such as machine learning or generative models. It is used to create realistic and representative data that can be used in place of real-world data when using the real-world data would be impractical or unethical due to strict privacy regulations. It is often more complex and requires more computational resources than manual mock data. As result, it is much more realistic and mimics the original data as close as possible.

In summary, mock data is manually created and is typically used for testing and development, while AI-generated synthetic data is created using artificial intelligence techniques and is used to create representative and realistic data.

Do you support mockers and mock data?

Yes we do. We offer various value-adding synthetic data optimization and augmentation features, including mockers, to take your data to the next level.

What do you mean by generating a ‘synthetic data twin’?

A synthetic data twin is an algorithm-generated replica of a real-world dataset and / or database. With a Synthetic Data Twin, Syntho aims to mimic an original dataset or database as close as possible to the original data to create a realistic representation of the original. With a synthetic data twin, we aim for superior synthetic data quality in comparison to the original data. We do this this with our synthetic data software that uses state-of-the-art AI models. Those AI models generate completely new datapoints and models them in such a way that we preserve the characteristics, relationships and statistical patterns of the original data to such an extent that you can use it as-if it is original data.

This can be used for a variety of purposes, such as testing and training machine learning models, simulating scenarios for research and development, and creating virtual environments for training and education. Synthetic data twins can be used to create realistic and representative data that can be used in place of real-world data when it is not available or when using the real-world data would be impractical or unethical due to strict data privacy regulations.

What are typical synthetic data use cases?

Generally, most of our clients use synthetic data for:

  • Software testing & development
  • Synthetic data for analytics, model development and advanced analytics (AI & ML)
  • Product demos
Data Quality
Do you preserve referential integrity over multi-table databases?

Yes we do. Our platform is optimized for databases and consequently, the preservation of referential integrity between datasets in the datgabase.

Curious to find out more about this?

Ask our experts directly.

Is the quality of AI generated synthetic data good enough for advanced analytics (e.g. AI, ML, BI)?

Yes it is. The synthetic data even holds patterns of which you did not know they were present in the original data.

But don’t just take our word for it. The analytics experts of SAS (global market leader in analytics) did an (AI) assessment of our synthetic data and compared it with the original data. Curious? Watch the whole event here or watch the short version about data quality here.

How does Syntho demonstrate the quality of generated synthetic data?

Guaranteeing that synthetic data holds the same data quality as the original data can be challenging, and often depends on the specific use case and the methods used to generate the synthetic data. Some methods for generating synthetic data, such as generative models, can produce data that is highly similar to the original data. Key question: how to demonstrate this?

There are some ways to ensure the quality of synthetic data:

  • Data quality metrics via our data quality report: One way to ensure that synthetic data holds the same data quality as the original data is to use data quality metrics to compare the synthetic data to the original data. These metrics can be used to measure things like similarity, accuracy, and completeness of the data. Syntho software included a data quality report with various data quality metrices.
  • External evaluation: since the data quality of synthetic data in comparison to original data is key, we recently did an assessment with the data experts of SAS (market leader in analytics) to demonstrate the data quality of synthetic data by Syntho in comparison to the real data. Edwin van Unen, analytics expert from SAS, evaluated generated synthetic datasets from Syntho via various analytics (AI) assessments and shared the outcomes. Watch a short recap of that video here.
  • Testing and evaluation by yourself: synthetic data can be tested and evaluated by comparing it to real-world data or by using it to train machine learning models and comparing their performance to models trained on real-world data. Why not test the data quality of synthetic data by yourself? Ask our experts for the possibilities of this here.

It’s important to note that synthetic data can never guarantee to be 100% similar to the original data, but it can be close enough to be useful for a specific use case. This specific use case can even be advanced analytics or training machine learning models.

Privacy
What does the Dutch Data Protection Authority say about using synthetic data?

One of the use cases that is specifically highlighted by the Dutch Data Protection Authority is using synthetic data as test data.

More can be found in this article.

What privacy metrics are in the Syntho QA report?

Syntho’s QA report contains three industry-standard metrics for evaluating data privacy. The idea behind each of these metrics is as follows:

  • Synthetic data (S) shall be “as close as possible”, but “not too close” to the target data (T).
  • Randomly selected holdout data (H) determines the benchmark for “too close”.
  • perfect solution generates new synthetic data that behaves exactly like the original data, but hasn’t been seen before (= H).
How do you demonstrate privacy?

Yes we do this via our QA report.

When synthesizing a dataset, it is essential to demonstrate that one is not able to re-identify individuals. In this video, Marijn introduces privacy measures that are in our quality report to demonstrate this.

Does Syntho see and / or process my data?

No. The Syntho Engine is a self-service platform. As a results, generating synthetic data with the Syntho Engine is possible in a way that in the end-to-end process, Syntho is never able to see and never required to process data.

Do I need to share my data with Syntho to generate synthetic data?

No. We optimized our platform in such a way that it can be easily deployed in the trusted environment of the customer. This ensures that data will never leave the trusted environment of the customer. Deployment options for the trusted environment of the customer are “on-premise” and in the “cloud environment of the customer (private cloud)”.

Does Syntho need access to my data to create synthetic data?

No we don’t. We can easily deploy the Syntho Engine on-premise or in your private cloud via docker.

Syntho Engine
Will the referential integrity be preserved when I have a database?

Yes. Syntho software is optimized for databases containing multiple tables.

As for this, Syntho automatically detects the data types, schemas and formats to maximize data accuracy. For multi-table database, we support automatic table relationship inference and synthesis to preserve referential integrity.

Do I need a GPU to use Syntho?

No, we optimized our platform to minimize computational requirements (e.g. no GPU required), without compromising on the data accuracy. In addition, we support auto scaling, so that one can synthesize huge databases.

Which data types do you support?

The Syntho Engine works best on structured, tabular data (anything that contains rows and columns). Within these structures, we support the following data types:

  • Structures data formatted in tables (categorical, numerical, etc.)
  • Direct identifiers and PII
  • Large datasets and databases
  • Geographic location data (like GPS)
  • Time series data
  • Multi-table databases (with referential integrity)
  • Open text data

 

Complex data support
Next to all regular types of tabular data, the Syntho Engine supports complex data types and complex data structures.

  • Time series
  • Multi-table databases
  • Open text

Read more.

Are specific skills required do use the Syntho Engine?

Not at all. Although it may take some effort to fully understand the advantages, workings and use cases of synthetic data, the process of synthesizing is very simple and anyone with basic computer knowledge can do it. For more information about the synthesizing process, check out this page or request a demo.

How many training records do I need to synthesize my data?

Syntho’s machine learning algorithms can better generalize the features with more entity records available, which decreases the privacy risk. A minimum column-to-row ratio of 1:500 is recommended. For example, if your source table has 6 columns, it should contain a minimum of 3000 rows.

How long does it take to generate synthetic data?

Naturally, the generation time depends on the size of the database. On average, a table with less than 1 million records is synthesized in less than 5 minutes.

How do you connect the Syntho Engine with your data?

Syntho enables you to easily connect with your databases, applications, data pipelines or file systems.

We support various integrated connectors so that you can connect with the source-environment (where the original data is stored) and the destination environment (where you want to write your synthetic data to) for an end-to-end integrated approach.

Connection features that we support:

  • Plug-and-play with Docker
  • 20+ database connectors
  • 20+ filesystem connectors
Which deployment options do you support?

The Syntho Engine is shipped in a Docker container and can be easily deployed and plugged into your environment of choice.

Possible deployment options include:

  • On-premise
  • Any (private) cloud
  • Any other environment

Read more.

Build better and faster with synthetic data today

Unlock data access, accelerate development, and enhance data privacy.

Join our newsletter

Keep up to date with synthetic data news