Why Data Will Disappoint You
To better understand a dataset, I recommend asking yourself these two questions:
- Dataset purpose: is it a museum or a storage locker?
- Dataset provenance: did you design the collection or inherit the data?
I introduced these questions in an earlier article, so now it’s time to spend a bit more time on a point worth repeating: the economics of data storage (cheaper every day!) have ushered in a norm of data hoarding, but the zeitgeist hasn’t adjusted accordingly. We still tend to expect something clean, scientific, objective, and useful in every dataset. My guess is that this mindset might be a carryover from the days when datasets were expensive to store and were thus often designed with care. For better or worse, those days are long gone.
A much better analogy for most modern datasets is that of a hoarder’s storage locker.
Data for solving specific problems
In this analogy, if you’re seeking to solve a very specific problem with a dataset, you have three options:
- Build a museum
- Buy a museum
- Buy a hoarder’s storage locker and hope for the best
The third option is usually the most expensive of the lot, but the second option can be a surprisingly close second on the disappointment scale.
Cleaning up someone else’s mess
Solving a specific problem by digging through someone’s dusty jumble might seem inexpensive at the beginning of your project… but it’s a choice that comes back to haunt you. It’s usually a fabulous way to throw your time and/or money at getting nowhere. If you’re looking for your holy grail, it’ll take you more time to figure out that it’s missing if you have to dredge a convoluted pile of digital junk in search of it. And even if it is in there somewhere, the…