Member-only story

Making Data Useful

The pros and cons of synthetic data

Should you be getting on the synthetic data bandwagon?

Cassie Kozyrkov

--

For my introductory articles on synthetic data, here’s a quick index to the series, broken up into bite-sized pieces:

Once you’re comfy with the basics, we can jump straight into the pros and cons of synthetic data.

Synthetic image by the author.

Biggest pros of synthetic data

  • Synthetic data can be the cheaper or easier to obtain than real-world data (if it isn’t, you probably don’t want it).
  • If getting real-world data is unfeasible, synthetic data gives you some hope that you might be able to build your desired automation solution anyway.
  • You have more control over the design of your synthetic dataset than your real-world dataset (this can be a con if you’re not careful).
  • Synthetic data can be great for debugging (see earlier article), especially for stress-testing your system’s ability to handle outliers and weird things.

Biggest pros of AI-generated synthetic data

All of the above, plus:

  • AI-generated synthetic data can be hard for humans to distinguish from the real thing.
  • If you get it from a good source, you’re taking advantage of a summary of millions of datapoints that you won’t need to collect/buy yourself.

Biggest cons of synthetic data

  • It’s synthetic! Real is always better if you’re trying to represent the real world, but sometimes it’s hard/expensive/impossible to get. Still, don’t expect synthetic data to represent reality.
  • It’s unnecessary when real data is cheap and plentiful.
  • It’s as simple-minded as we are. When we create a pithy recipe for making new datapoints, especially complex datapoints, we can’t trust ourselves to have represented all of reality’s relevant characteristics.

--

--

No responses yet

Write a response