Making Data Useful

How Good Data Goes Bad

The data quality crisis no one is talking about

Cassie Kozyrkov
3 min readSep 26, 2023

--

A rule of thumb to save you tears in the long run is to assume every dataset is more like a hoarder’s storage locker than a well-curated museum until proven otherwise.

When in doubt, assume your data’s a junkyard.

But even if you’re not dealing with a dataset that’s a hoardsplosion of we-may-as-wells, there are two ways that fit-for-purpose data turns into garbage:

  1. Information loss during conversion
  2. Information selection issues
Image belongs to the author.

There are plenty more ways that bad data can happen to good intentions, but let’s talk about these two major ones for now.

Information loss during conversion

Data quality erodes whenever there’s a problem with the physical conversion of reality into electronic records. This is a relatively simple issue that manifests in numerous ways, from janky hard disks and broken equipment to real world plans going awry: Were your sensors calibrated? Did your laptop run out of juice? Did the people you paid to enter survey information actually write down what they were supposed to? Did you wait too long while you relied on human memory for temporary storage? (Quick! How many hours did you sleep last night? Now tell me how many hours you slept two Mondays ago.)

Information selection issues

The other way your data museum becomes a junkyard is when you make bad choices about what to record and how to record it. Was your attribute of interest stored the right way? Did you lose information because you thought some attributes weren’t important? For example, did you forget to record the date and time you took each observation? Did you ask people to use a 5-point scale for what should have been a more precise numerical value? Did you ask all the relevant questions? And so on.

Unfortunately, bad data design choices happen to good projects all the time. When you stop to think about it, it’s hardly surprising. After all, whose job is data quality anyway?

Whose job is data quality anyway?

--

--

Cassie Kozyrkov

Chief Decision Scientist, Google. ❤️ Stats, ML/AI, data, puns, art, theatre, decision science. All views are my own. twitter.com/quaesita