Making Data Useful

Data: A Hoarder’s Storage Locker, Not a Magical Museum

Why data isn’t as useful as we think

Cassie Kozyrkov
3 min readAug 27, 2023

--

There’s a common misconception that data is the next best thing to a holy relic of science — objective, mathematical, clean, correct, and above all, always useful.

A more accurate analogy for data would be a hoarder’s storage locker.

If you’re like most people, you envision data as a magical museum, meticulously organized and filled with diamonds and other gems, so brace yourself for a reality check!

A more accurate analogy for data would be a hoarder’s storage locker, filled to the brim with all kinds of stuff. If you’re willing to go spelunking into the mess you’ve inherited, you might find something valuable in there, but brace yourself for a pile of broken garbage that only its hoarder could love. (That is, if that hoarder even remembers what on earth they squirreled away in all that mess.)

Image is property of the author.

Most datasets come with about as much documentation as your sink of dirty dishes.

Data documentation is simultaneously an unsolved research problem — it’s not obvious how to design docs optimally for transparency and effective data sharing — and an annoying chore for the data hoarding enthusiast (since you’d keep fewer objects if you had to document them all at archival quality).

That’s why it’s hardly surprising that most datasets come with about as much documentation as your sink of dirty dishes. After all, the idea of storing as much as possible now in the hopes that it’s useful later is, by design, a way to punt the problem to a future somebody foolhardy enough to try to make a museum out of your rat’s nest. (Oh, the internal rage when the somebodies on both sides of that equation turn out to be you.)

To better understand data, I recommend asking yourself these two questions:

  • Dataset purpose: is it a museum or a storage locker?
  • Dataset provenance: did you design the collection or inherit the data?

--

--

Cassie Kozyrkov

Chief Decision Scientist, Google. ❤️ Stats, ML/AI, data, puns, art, theatre, decision science. All views are my own. twitter.com/quaesita