Member-only story
Why would you want synthetic data?
Is synthetic data the hot new thing or an act of desperation?
First off, what is synthetic data and why should you care?
Synthetic data is, to put it bluntly, fake data. If you’re new to the topic, I’d recommend starting with my previous blog post about it. (Or continue here, YOLO.)
Synthetic data can be a collection of datapoints to fill a table or database, but these days there are plenty more complicated and interesting objects being simulated. Much of the current hype around synthetic data involves AI-generated images and text.
Let’s put this in context to make the pros and cons of jumping on this bandwagon a little clearer: why might you want AI-generated synthetic data?
First off, the answer isn’t that you want to make some art or poetry. If you’re using generative AI as a raw material for making something delightful, we would call that creative expression (yours, not the machine’s), not synthetic data. The term synthetic data implies that each thing you’ve synthesized is a datapoint for a dataset, most likely to be used for something like:
- Training an AI system
- Debugging an AI system
- Planning your statistical procedure
- Testing your code/model before going live in production
- Creating research benchmarks and comparing theoretical performance
- Fighting bias
- Getting sued (not recommended!)
Why would you want synthetic data in your dataset?
If you’re creating a dataset for debugging/testing some software, you might be making some weird examples to see if your system chokes on them. If you want data that’s unlike what’s found in nature, you make it yourself. Like candy. (Learn more here.)
If you’re creating research algorithms for other people to apply as tools for solving their specific problems, you’d most likely use synthetic data in your initial development. Why? You’d want to be able to the compare performance of your invention against the next best thing in a dataset with all the gnarly real-world issues ironed out. Kind of like…