Making Data Useful

How Good Data Goes Bad

The data quality crisis no one is talking about

Cassie Kozyrkov
3 min readSep 26, 2023

--

A rule of thumb to save you tears in the long run is to assume every dataset is more like a hoarder’s storage locker than a well-curated museum until proven otherwise.

When in doubt, assume your data’s a junkyard.

But even if you’re not dealing with a dataset that’s a hoardsplosion of we-may-as-wells, there are two ways that fit-for-purpose data turns into garbage:

  1. Information loss during conversion
  2. Information selection issues
Image belongs to the author.

There are plenty more ways that bad data can happen to good intentions, but let’s talk about these two major ones for now.

Information loss during conversion

Data quality erodes whenever there’s a problem with the physical conversion of reality into electronic records. This is a relatively simple issue that manifests in numerous ways, from janky hard disks and broken equipment to real world plans going awry: Were your sensors calibrated? Did your laptop run out of juice? Did the people you paid to enter survey information actually write down what they were supposed to? Did you wait too long while you relied on human…

--

--

Cassie Kozyrkov

Chief Decision Scientist, Google. ❤️ Stats, ML/AI, data, puns, art, theatre, decision science. All views are my own. twitter.com/quaesita