Why would you want synthetic data?

Is synthetic data the hot new thing or an act of desperation?

Cassie Kozyrkov
3 min readJul 21, 2023

--

First off, what is synthetic data and why should you care?

Synthetic data is, to put it bluntly, fake data. If you’re new to the topic, I’d recommend starting with my previous blog post about it. (Or continue here, YOLO.)

Synthetic data can be a collection of datapoints to fill a table or database, but these days there are plenty more complicated and interesting objects being simulated. Much of the current hype around synthetic data involves AI-generated images and text.

An example of an AI-generated image made with Midjourney.

Let’s put this in context to make the pros and cons of jumping on this bandwagon a little clearer: why might you want AI-generated synthetic data?

First off, the answer isn’t that you want to make some art or poetry. If you’re using generative AI as a raw material for making something delightful, we would call that creative expression (yours, not the machine’s), not synthetic data. The term synthetic data implies that each thing you’ve synthesized is a datapoint for a dataset, most likely to be used for something like:

--

--

Cassie Kozyrkov

Chief Decision Scientist, Google. ❤️ Stats, ML/AI, data, puns, art, theatre, decision science. All views are my own. twitter.com/quaesita