Why Data Will Disappoint You

Data expectations haven’t caught up to data economics

Cassie Kozyrkov
5 min readSep 27

--

To better understand a dataset, I recommend asking yourself these two questions:

  • Dataset purpose: is it a museum or a storage locker?
  • Dataset provenance: did you design the collection or inherit the data?

I introduced these questions in an earlier article, so now it’s time to spend a bit more time on a point worth repeating: the economics of data storage (cheaper every day!) have ushered in a norm of data hoarding, but the zeitgeist hasn’t adjusted accordingly. We still tend to expect something clean, scientific, objective, and useful in every dataset. My guess is that this mindset might be a carryover from the days when datasets were expensive to store and were thus often designed with care. For better or worse, those days are long gone.

A much better analogy for most modern datasets is that of a hoarder’s storage locker.

Image belongs to the author.

Data for solving specific problems

In this analogy, if you’re seeking to solve a very specific problem with a dataset, you have three options:

  • Build a museum
  • Buy a museum
  • Buy a hoarder’s storage locker and hope for the best

The third option is usually the most expensive of the lot, but the second option can be a surprisingly close second on the disappointment scale.

Cleaning up someone else’s mess

Solving a specific problem by digging through someone’s dusty jumble might seem inexpensive at the beginning of your project… but it’s a choice that comes back to haunt you. It’s usually a fabulous way to throw your time and/or money at getting nowhere. If you’re looking for your holy grail, it’ll take you more time to figure out that it’s missing if you have to dredge a convoluted pile of digital junk in search of it. And even if it is in there somewhere, the…

--

--

Cassie Kozyrkov

Chief Decision Scientist, Google. ❤️ Stats, ML/AI, data, puns, art, theatre, decision science. All views are my own. twitter.com/quaesita