Member-only story

Why would you want synthetic data?

Is synthetic data the hot new thing or an act of desperation?

Cassie Kozyrkov
3 min readJul 21, 2023

First off, what is synthetic data and why should you care?

Synthetic data is, to put it bluntly, fake data. If you’re new to the topic, I’d recommend starting with my previous blog post about it. (Or continue here, YOLO.)

Synthetic data can be a collection of datapoints to fill a table or database, but these days there are plenty more complicated and interesting objects being simulated. Much of the current hype around synthetic data involves AI-generated images and text.

An example of an AI-generated image made with Midjourney.

Let’s put this in context to make the pros and cons of jumping on this bandwagon a little clearer: why might you want AI-generated synthetic data?

First off, the answer isn’t that you want to make some art or poetry. If you’re using generative AI as a raw material for making something delightful, we would call that creative expression (yours, not the machine’s), not synthetic data. The term synthetic data implies that each thing you’ve synthesized is a datapoint for a dataset, most likely to be used for something like:

Why would you want synthetic data in your dataset?

If you’re creating a dataset for debugging/testing some software, you might be making some weird examples to see if your system chokes on them. If you want data that’s unlike what’s found in nature, you make it yourself. Like candy. (Learn more here.)

If you’re creating research algorithms for other people to apply as tools for solving their specific problems, you’d most likely use synthetic data in your initial development. Why? You’d want to be able to the compare performance of your invention against the next best thing in a dataset with all the gnarly real-world issues ironed out. Kind of like…

--

--

Cassie Kozyrkov
Cassie Kozyrkov

Written by Cassie Kozyrkov

Chief Decision Scientist, Google. ❤️ Stats, ML/AI, data, puns, art, theatre, decision science. All views are my own. decision.substack.com

Responses (4)

Write a response