MIT Sloan is the leader in research and teaching in AI. Dive in to discover why.

Which program is right for you?

MIT Sloan Campus life

Through intellectual rigor and experiential learning, this full-time, two-year MBA program develops leaders who make a difference in the world.

Earn your MBA and SM in engineering with this transformative two-year program.

A rigorous, hands-on program that prepares adaptive problem solvers for premier finance careers.

A 12-month program focused on applying the tools of modern data science, optimization and machine learning to solve real-world business problems.

Combine an international MBA with a deep dive into management science. A special opportunity for partner and affiliate schools only.

A doctoral program that produces outstanding scholars who are leading in their fields of research.

Bring a business perspective to your technical and quantitative expertise with a bachelor’s degree in management, business analytics, or finance.

Apply now and work for two to five years. We'll save you a seat in our MBA class when you're ready to come back to campus for your degree.

Executive Programs

The 20-month program teaches the science of management to mid-career leaders who want to move from success to significance.

A full-time MBA program for mid-career leaders eager to dedicate one year of discovery for a lifetime of impact.

A joint program for mid-career professionals that integrates engineering and systems thinking. Earn your master’s degree in engineering and management.

Non-degree programs for senior executives and high-potential managers.

A non-degree, customizable program for mid-career professionals.

5 actions to elevate customer experience in physical retail

Scope 3 emissions top supply chain sustainability challenges

3 ways to improve the mortgage market

Credit: Marysia Machulska

Ideas Made to Matter

What is synthetic data — and how can it help you competitively?

Brian Eastwood

Jan 23, 2023

Companies committed to data-based decision-making share common concerns about privacy, data integrity, and a lack of sufficient data.

Synthetic data aims to solve those problems by giving software developers and researchers something that resembles real data but isn’t. It can be used to test machine learning models or build and test software applications without compromising real, personal data.

A synthetic data set has the same mathematical properties as the real-world data set it’s standing in for, but it doesn’t contain any of the same information. It’s generated by taking a relational database, creating a generative machine learning model for it, and generating a second set of data.

The result is a data set that contains the general patterns and properties of the original — which can number in the billions — along with enough “noise” to mask the data itself, said Kalyan Veeramachaneni , principal research scientist with MIT’s Schwarzman College of Computing .

Gartner has estimated that 60% of the data used in artificial intelligence and analytics projects will be synthetically generated by 2024. Synthetic data offers numerous value propositions for enterprises, including its ability to fill gaps in real-world data sets and replace historical data that’s obsolete or otherwise no longer useful.

“You can take a phone number and break it down. When you resynthesize it, you’re generating a completely random number that doesn’t exist,” Veeramachaneni said. “But you can make sure it still has the properties you need, such as exactly 10 digits or even a specific area code.”

Synthetic data: “no significant difference” from the real thing

A decade ago, Veeramachaneni and his research team were working with large amounts of student data from an online educational platform. The data was stored on a single machine and had to be encrypted. This was important for security and regulatory reasons, but it slowed things down.

At first, Veeramachaneni’s research team tried to create a fake data set. But because the fake data was randomly generated, it did not have the same statistical properties as the real data.

Gartner has estimated that 60% of the data used in AI and analytics projects will be synthetically generated by 2024.

That’s when the team began developing the Synthetic Data Vault , an open-source software tool for creating and using synthetic data sets. It was built using real data to train a generative machine learning model, which then generated samples that had the same properties as the real data, without containing the specific information.

To begin, researchers created synthetic data sets for five publicly available data sets. They then invited freelance data scientists to develop predictive models on both the synthetic and the real data sets and to compare the results.

In a 2016 paper , Veeramachaneni and co-authors Neha Patki and Roy Wedge, also from MIT, demonstrated that there was “ no significant difference ” between predictive models generated on synthetic data and real data.

“We were starting to realize that we can do a significant amount of software development with synthetic data,” Veeramachaneni said. Between his work at MIT and his role with PatternEx, an AI cybersecurity startup, “I started getting more and more evidence every day that there was a need for synthetic data,” he said.

Use cases have included offshore software development, medical research, and performance testing, which can require data sets significantly larger than most organizations have on hand.

The Synthetic Data Vault is freely available on GitHub , and the latest of its 40 releases was issued in December 2022. The software, now part of DataCebo , has been downloaded more than a million times, Veeramachaneni said, and is used by financial institutions and insurance companies, among others.

It’s also possible for an organization to build its own synthetic data sets. Generally speaking, it requires an existing data set, a machine learning model, and the expertise needed to train a model and evaluate its output.  

A step above de-identification

Software developers and data scientists often work with data sets that have been “de-identified,” meaning that personal information, such as a credit card number, birth date, bank account number, or health plan number, has been removed to protect individuals’ privacy. This is required for publicly available data, and it’s a cornerstone of health care and life science research.

Related Articles

But it’s not foolproof. A list of credit card transactions might not display an account number, Veeramachaneni said, but the date, location, and amount might be enough to trace the transaction back to the night you met a friend for dinner. On a broader scale, even health records de-identified against 40 different variables can be re-identified if, for example, someone takes a specific medication to treat a rare disease.

A synthetic data set doesn’t suffer these shortcomings. It preserves the correlations among data variables — the rare disease and the medication — without linking the data to the individual with that diagnosis or prescription. “You can model and sample the properties in the original data without having a problem of data leakage,” Veeramachaneni said.

This means that synthetic data can be shared much more easily than real data. Industry best practices in health care and finance suggest that data should be encrypted at rest, in use, and in transit. Even if this isn’t explicitly required in federal regulations, it’s implied by the steep penalties assessed for the failure to protect personal information in the event of a data breach.

In the past, that’s been enough to stop companies from sharing data with software developers, or even sharing it within an organization. The intention is to keep data in (purportedly) safe hands, but the effect is that it hinders innovation, as data isn’t readily available for building a software prototype or identifying potential growth opportunities.

“There are a lot of issues around data management and access,” Veeramachaneni said. It gets even thornier when development, testing, and debugging teams have been offshored. “You have to increase productivity, but you don’t want to put people in a situation where they have to make judgment calls about whether or not they should use the data set,” he said. 

Synthetic data eliminates the need to move real data sets from one development team to another. It also lets individuals store data locally instead of logging into a central server, so developers can work at the pace they’re used to.

An additional benefit, Veeramachaneni said, is the ability to address bias in data sets as well as the models that analyze them. Since synthetic data sets aren’t limited to the original sample size, it’s possible to create a new data set and refine a machine learning model before using the data for development or analysis.

Access to data means access to opportunities

The ability to freely share and work with synthetic data might be its greatest benefit: It’s broadly available and ready to be used.

For Veeramachaneni, accessing synthetic data is like accessing computing power. He recalled going to the computer lab at night in graduate school about 20 years ago to run data simulations on 30 computers at the same time. Today, students can do this work on their laptops, thanks to the availability of high-speed internet and cloud computing resources.

Data today is treated like the computer lab of yesteryear: Access is restricted — and so are opportunities for college students, professional developers, and data scientists, to test new ideas. With far fewer necessary limitations on who can use it, synthetic data can provide these opportunities, Veeramachaneni said.

“If I hadn’t had access to data sets the way I had in the last 10 years, I wouldn’t have a career,” he said. Synthetic data can remove the speed bumps and bottlenecks that are slowing down data work, Veeramachaneni said, and it can enhance both individual careers and overall efficiency.

A 3D dog from a single photograph

Synthetic data can be more than rows in a database — it can also be art. Earlier this year, social media was enamored with DALL-E , the AI and natural language processing system that creates new, realistic images from a written description. Many people appreciated the possibility for whimsical art: NPR put DALL-E to work depicting a dinosaur listening to the radio and legal affairs correspondent Nina Totenberg dunking a basketball in space.

This technology has been years in the making as well. Around the same time that Veeramachaneni was building the Synthetic Data Vault, Ali Jahanian was applying his background in visual arts to AI at MIT’s C omputer Science and Artificial Intelligence Laboratory . AI imaging was no stranger to synthetic data. The 3D flight simulator is a prime example, creating a realistic experience of, say, landing an airplane on an aircraft carrier. 

These programs require someone to input parameters first. “There’s a lot of time and effort in creating the model to get the right scene, the right lighting, and so on,” said Jahanian, now a research scientist at Amazon. In other words, someone needed to take the time to describe the aircraft carrier, the ocean, the weather, and so on in data points that a computer could understand.

As Veeramachaneni did with his own data set, Jahanian focused on developing AI models that could generate graphical outputs based on observations of and patterns in real-world data, without the need for manual data entry.

The next step was developing an AI model that could transform a static image. Given a single 2D picture of a dog, the model can let you view the dog from different angles, or with a different color fur.

“The photo is one moment in time, but the synthetic data could be different views of the same object,” Jahanian said. “You can exhibit capabilities that you don’t have in real data.”

And you can do it at no cost: Both the Synthetic Data Vault and DALL-E are free. Microsoft (which is backing the DALL-E project financially) has said  that users are creating more than 2 million images per day.

 There are concerns about data privacy, ownership, and misinformation. An oil painting of a Tyrannosaurus rex listening to a tabletop radio is one thing; a computer-generated image of protestors on the steps of the U.S. Capitol is another.

Jahanian said these concerns are valid but should be considered in the larger context of what the technology makes possible. One example is medicine: A visualization of a diseased heart would be a lot more impactful than a lengthy clinical note describing it.

“We need to embrace what these models provide to us, rather than being skeptical of them,” Jahanian said. “As people see how they work, they’ll start to influence how they are shaped and trained and used, and we can make them more accessible and more useful for society.”

Read next: Data literacy for leaders

A plane, truck, and factory with smoke stack sit on top of a scale

Help | Advanced Search

Computer Science > Machine Learning

Title: synthetic data -- what, why and how.

Abstract: This explainer document aims to provide an overview of the current state of the rapidly expanding work on synthetic data technologies, with a particular focus on privacy. The article is intended for a non-technical audience, though some formal definitions have been given to provide clarity to specialists. This article is intended to enable the reader to quickly become familiar with the notion of synthetic data, as well as understand some of the subtle intricacies that come with it. We do believe that synthetic data is a very useful tool, and our hope is that this report highlights that, while drawing attention to nuances that can easily be overlooked in its deployment.

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

8 best practices for synthetic data generation

6 december 2024, rina caballar.

CHQ, Marketing

Cole Stryker

Editorial Lead, AI Models

When you hear the word “synthetic,” you might associate it with something artificial or fabricated. Take synthetic fibers such as polyester and nylon, for example, which are man-made through chemical processes.

While synthetic fibers are more affordable and easier to mass-produce, their quality can rival that of natural fibers. They’re often designed to mimic their natural counterparts and are engineered for specific uses—be it elastic elastane, heat-retaining acrylic or durable polyester.

The same is true for synthetic data . This artificially generated information can supplement or even replace real-world data when training or testing artificial intelligence (AI) models. Compared to real datasets that can be costly to obtain, difficult to access, time-consuming to label and have a limited supply, synthetic datasets can be synthesized through computer simulations or generative models . This makes them cheaper to produce on-demand in nearly limitless volumes and customized to an organization’s needs.

Despite its benefits, synthetic data also comes with challenges. The generation process can be complex, with data scientists having to create realistic data while still maintaining quality and privacy.

Yet synthetic data is here to stay. Research firm Gartner predicts that by 2026, 75% of businesses will use generative AI to create synthetic customer data. 1

To help enterprises get the most out of artificial data, here are 8 best practices for synthetic data generation:

1. Know your purpose

Understand why your business needs synthetic data and the use cases where it might be more helpful than real data. In healthcare, for instance, patient records or medical images can be artificially generated—without containing any sensitive data or personally identifiable information (PII) . This also allows safe data sharing between researchers and data science teams.

Synthetic data can be used as test data during software development, standing in for sensitive production data but still emulating its characteristics. It also allows companies to avoid copyright and intellectual property issues, generating data instead of employing web crawlers to scrape and collect information from websites without users’ knowledge or consent.

Also, artificial data can act as a form of data augmentation . It can be used to boost data diversity, especially for underrepresented groups in AI model training. And when information is sparse, synthetic data can fill in the gaps.

Financial services firm J.P. Morgan, for example, found it difficult to effectively train AI-powered models for fraud detection due to the lack of fraudulent cases compared to non-fraudulent ones. The organization used synthetic data generation to create more examples of fraudulent transactions (link resides outside ibm.com), thereby enhancing model training.

2. Preparation is key

Synthetic data quality is only as good as the real-world data underpinning it. When preparing original datasets for synthetic data generation by machine learning (ML) algorithms, make sure to check for and correct any errors, inaccuracies and inconsistencies. Remove any duplicates, and enter the missing values.

Consider adding edge cases or outliers to the original data. These data points can represent uncommon events, rare scenarios or extreme cases that mirror the unpredictability and variability of the real world.

“It comes down to the seed examples,” says Akash Srivastava, chief architect at InstructLab  (link resides outside ibm.com), an open source project from IBM® and Red Hat that employs a collaborative approach to adding new knowledge and skills to a model, which is powered by IBM’s new synthetic data generation method and phased-training protocol . “The examples through which you seed the generation need to mimic your real-world use case.”

3. Diversify data sources

Synthetic data is still prone to inheriting and reflecting the biases that might be present in the original data it’s based on. Blending information from multiple sources, including different demographic groups and regions, can help mitigate bias in the generated data.

Diverse data sources can also elevate the quality of synthetic datasets. Varied sources can offer essential details or vital context that a single source or only a handful of sources lack. Also, incorporating retrieval-augmented generation into the synthetic data generation process can provide access to up-to-date and domain-specific data that can increase accuracy and further improve quality.

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

4. Choose appropriate synthesis techniques

Selecting the right synthetic data generation technique depends on a few factors, including data type and complexity. Relatively simple data might benefit from statistical methods. More intricate datasets— structured data like tabular data or unstructured data such as images or videos, for example—might require deep learning models. Enterprises might also opt to combine synthesis techniques according to their requirements.

Here are some common mechanisms for synthetic data generation:

Statistical distribution

Data scientists can analyze statistical distributions in real data and generate synthetic samples that mirror those distributions. However, this requires significant knowledge and expertise, and not all data fit into a known distribution.

Generative adversarial networks

Generative adversarial networks (GANs) consist of two neural networks : a generator that creates synthetic data and a discriminator that acts as an adversary, discriminating between artificial and real data. Both networks are trained iteratively, with the discriminator’s feedback improving the generator’s output until the discriminator is no longer able to distinguish artificial from real data.

GANs can be used to generate synthetic images for computer vision and image classification tasks.

Variational autoencoders

Variational autoencoders (VAEs) are deep learning models that generate variations of the data they’re trained on. An encoder compresses input data into a lower-dimensional space, capturing the meaningful information contained in the input. A decoder then reconstructs new data from this compressed representation. Like GANs, VAEs can be used for image generation.

Transformer models

Transformer models , such as generative pretrained transformers (GPTs) , excel in understanding the structure and patterns in language. They can be used to generate synthetic text data for natural language processing applications or to create artificial tabular data for classification or regression tasks.

5. Consider model collapse

It’s important to consider model collapse , wherein a model’s performance declines as it’s repeatedly trained on AI-generated data. That’s why it’s essential to ground the synthetic data generation process in real data.

At InstructLab , for instance, synthetic data generation is driven by a taxonomy, which defines the domain or topics that the original data comes from. This prevents the model from deciding the data that it must be trained on.

“You’re not asking the model to just keep going in a loop and collapse. We completely bypass the collapsing by decoupling the model from the sampling process,” Srivastava says.

6. Employ validation methods

High-quality data is vital to model performance. Verify synthetic data quality by using fidelity- and utility-based metrics. Fidelity refers to how closely synthetic datasets resemble real-world datasets. Utility evaluates how well synthetic data can be used to train deep learning or ML models.

Measuring fidelity involves comparing synthetic data with the original data, often by using statistical methods and visualizations like histograms. This helps determine whether generated datasets preserve the statistical properties of real datasets, such as distribution, mean, median, range and variance, among others.

Assessing correlational similarity through correlation and contingency coefficients, for example, is also essential to help ensure dependencies and relationships between data points are maintained and accurately represent real-world patterns. Neural networks, generative models and language models are typically skilled at capturing relationships in tabular data and time-series data.

Measuring utility entails using synthetic data as training data for machine learning models, then comparing model performance against training with real data. Here are some common metrics for benchmarking :

Accuracy or precision calculates the percentage of correct predictions.

Recall quantifies the actual correct predictions.

The F1 score combines accuracy and recall into a single metric.

Both the inception score and Fréchet inception distance (FID) evaluate the quality of generated images.

Synthetic data generation tools or providers might already have these metrics on hand, but you can also use other analytics packages like SDMetrics (link resides outside ibm.com), an open source Python library for assessing tabular synthetic data.

The human touch is still crucial when validating artificial data, and it can be as simple as taking 5 to 10 random samples from the synthetic dataset and appraising them yourself. “You have to have a human in the loop for verification,” says Srivastava. “These are very complicated systems, and just like in any complicated system, there are many delicate points at which things might go wrong. Rely on metrics, rely on benchmarks, rigorously test your pipeline, but always take a few random samples and manually check that they are giving you the kind of data you want.”

7. Keep data privacy top of mind

One of the advantages of using synthetic data is that it doesn’t contain any sensitive data or PII. However, enterprises must still verify that the new data they generate complies with privacy regulations. Such as the European Union’s General Data Protection Regulation (GDPR) or the US Health Insurance Portability and Accountability Act (HIPAA).

Treat synthetic data like proprietary data, applying built-in security measures and access controls to prevent data hacks and leaks. Safeguards must also be applied during the generation process to prevent the risk of synthetic data being reverse engineered and traced back to its real-world equivalent, revealing sensitive information during data analysis. These safeguards include techniques like masking to hide or mask sensitive data, anonymization to scrub or remove PII, and differential privacy to add “noise” or introduce randomness to the dataset.

“At the minimum, PII masking or scrubbing is required, or you could go a step further and use differential privacy methods,” Srivastava says. “It becomes even more important if you are not using local models. If you’re sending [data] to some third-party provider, it is even more important that you’re extra careful about these aspects.”

Note that synthetic data can’t usually be optimized simultaneously for fidelity, utility and privacy —there will often be a tradeoff. Masking or anonymization might nominally reduce utility, while differential privacy might slightly decrease accuracy. However, not implementing any privacy measures can potentially expose PII. Organizations must balance and prioritize what is crucial for their specific use cases.

8. Document, monitor and refine

Keep a record of your synthetic data generation workflow , such as strategies for cleaning and preparing original datasets, mechanisms for generating data and maintaining privacy, and verification results. Include the rationale behind your choices and decisions for accountability and transparency.

Documentation is especially valuable when conducting periodic reviews of your synthetic data generation process. These records serve as audit trails that can help with evaluating the effectiveness and reproducibility of the workflow.

Routinely monitor how synthetic data is used and how it performs to identify any unexpected behaviors that might crop up or opportunities for improvement. Adjust and refine the generation process as needed.

Much like fibers are the foundation of fabrics, data is the building block of AI models. And while synthetic data generation is still in its early stages. Advancements in the generation process can help enhance synthetic data in the future to a point where it matches the quality, reliability and utility of real data, akin to the way synthetic fibers almost equal natural fibers.

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

1 3 Bold and Actionable Predictions for the Future of GenAI  (link resides outside ibm.com), Gartner, 12 April 2024

Use the power of analytics and business intelligence to plan, forecast and shape future outcomes that best benefit your company and customers.

Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

Explore the data leader's guide to building a data-driven organization and driving business advantage.

Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.

Connect your data and analytics strategy to business objectives with these 4 key steps.

Take a deeper look into why business intelligence challenges might persist and what it means for users across an organization.

To thrive, companies must use data to build customer loyalty, automate business processes and innovate with AI-driven solutions.

Unlock the value of enterprise data with IBM Consulting, building an insight-driven organization that delivers business advantage.

Introducing Cognos Analytics 12.0, AI-powered insights for better decision-making.

IMAGES

  1. What is Synthetic Data? Use Cases & Benefits in 2023

    research on synthetic data

  2. What is Synthetic Data? What are its Use Cases & Benefits?

    research on synthetic data

  3. The Ultimate Guide to Synthetic Data: Uses, Benefits & Tools

    research on synthetic data

  4. Generating and evaluating synthetic data: a two-sided research agenda

    research on synthetic data

  5. What is Synthetic Data? What are its Use Cases?

    research on synthetic data

  6. (PDF) MLReal: Bridging the gap between training on synthetic data and

    research on synthetic data

COMMENTS

  1. Best Practices and Lessons Learned on Synthetic Data

    Future research directions on synthetic data could focus on improving the fidelity and controllability of generative models and developing standardized evaluation and contamination protocols and tools. We could also explore the integration of synthetic data with other techniques and its application in other domains. Despite the challenges, the ...

  2. What is synthetic data

    Synthetic data: "no significant difference" from the real thing. A decade ago, Veeramachaneni and his research team were working with large amounts of student data from an online educational platform. The data was stored on a single machine and had to be encrypted. This was important for security and regulatory reasons, but it slowed things ...

  3. Synthetic data

    Synthetic data are artificially generated data rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models. ... In this way, the new data can be used for studies and research, and it protects the confidentiality of the original ...

  4. What is synthetic data, and how can it advance research and development?

    To investigate the potential for synthetic data in research and innovation, we commissioned a team of synthetic data experts at the Alan Turing Institute to prepare a survey detailing how synthetic data relates to ground truth data, as well as how it could be used in privacy-preserving data analysis and beyond. They also considered how ...

  5. Synthetic Data

    Synthetic data is rapidly gaining attention as a privacy-safe data-sharing method that can unravel the potential for research and innovation in medical informatics. Synthetic data with high realism can substitute real datasets in many applications including algorithm testing and validation, technology evaluation, teaching and training, and data ...

  6. Exploiting GPT for synthetic data generation: An empirical study

    Additionally, if the training data contains biases, those biases may be perpetuated in the synthetic data, leading to skewed research outcomes. Another significant challenge is the uncertainty surrounding the robustness and reliability of the synthetic data produced by LLMs. Unlike traditional statistical models, LLMs function as 'black boxes ...

  7. [2205.03257] Synthetic Data -- what, why and how?

    This explainer document aims to provide an overview of the current state of the rapidly expanding work on synthetic data technologies, with a particular focus on privacy. The article is intended for a non-technical audience, though some formal definitions have been given to provide clarity to specialists. This article is intended to enable the reader to quickly become familiar with the notion ...

  8. PDF Synthetic Data

    breadth of applications and approaches to synthetic data, we propose the fol-lowing de nition. De nition 1 Synthetic data is data that has been generated using a purpose-built mathematical model or algorithm, with the aim of solving a (set of) data science task(s). We contrast synthetic data with real data, which is generated not by a model

  9. GANs in the Panorama of Synthetic Data Generation Methods

    The author also outlines three primary applications of synthetic data in ML: (1) training ML models with synthetic data to make predictions on real-world data; (2) augmenting existing real datasets with synthetic data to address underrepresented portions of the data distribution; and (3) resolving privacy or legal concerns by generating ...

  10. Synthetic Data Generation

    Yet synthetic data is here to stay. Research firm Gartner predicts that by 2026, 75% of businesses will use generative AI to create synthetic customer data. 1. To help enterprises get the most out of artificial data, here are 8 best practices for synthetic data generation: 1. Know your purpose