Member-only story
Making Data Useful
The pros and cons of synthetic data
Should you be getting on the synthetic data bandwagon?
5 min readJul 28, 2023
For my introductory articles on synthetic data, here’s a quick index to the series, broken up into bite-sized pieces:
- What is synthetic data?
- The synthetic data field guide
- Why would you want synthetic data?
- AI-generated synthetic data
Once you’re comfy with the basics, we can jump straight into the pros and cons of synthetic data.
Biggest pros of synthetic data
- Synthetic data can be the cheaper or easier to obtain than real-world data (if it isn’t, you probably don’t want it).
- If getting real-world data is unfeasible, synthetic data gives you some hope that you might be able to build your desired automation solution anyway.
- You have more control over the design of your synthetic dataset than your real-world dataset (this can be a con if you’re not careful).
- Synthetic data can be great for debugging (see earlier article), especially for stress-testing your system’s ability to handle outliers and weird things.
Biggest pros of AI-generated synthetic data
All of the above, plus:
- AI-generated synthetic data can be hard for humans to distinguish from the real thing.
- If you get it from a good source, you’re taking advantage of a summary of millions of datapoints that you won’t need to collect/buy yourself.
Biggest cons of synthetic data
- It’s synthetic! Real is always better if you’re trying to represent the real world, but sometimes it’s hard/expensive/impossible to get. Still, don’t expect synthetic data to represent reality.
- It’s unnecessary when real data is cheap and plentiful.
- It’s as simple-minded as we are. When we create a pithy recipe for making new datapoints, especially complex datapoints, we can’t trust ourselves to have represented all of reality’s relevant characteristics.