Back-to-basics on data science fundamentals

Test yourself! How many of these core statistical concepts are you able to explain?

CLT, CDF, Distribution, Estimate, Expected Value, Histogram, Kurtosis, MAD, Mean, Median, MGF, Mode, Moment, Parameter, Probability, PDF, Random Variable, Random Variate, Skewness, Standard Deviation, Tails, Variance

Got some gaps in your knowledge? Read on!

Note: If you see an unfamiliar term below, follow the link for an explanation.

Random variable

A random variable (R.V.) is a mathematical function that turns reality into numbers. Think of it as a rule to decide what number you should record in your dataset after a real-world event happens.

A random variable is…

Getting Started

Know your species of machine learning task

The coarsest way to, ahem, classify supervised machine learning (ML) tasks is into classification versus prediction. (What’s supervised ML? See the video below if you need a refresher.)

Before we dive deeper into supervised learning, in this video I give you a quick refresher on how that differs from unsupervised learning.

Let’s start by making sure we’re all on the same page with the basic basics.

Basics: Algorithm vs Model

If you’re new to these terms, I recommend reading this. For the too-busy folk among you, here comes the briefest of reminders:

The point of ML/AI is to automate tasks by turning data (examples)…

If you’re the kind of person who likes to keep a tidy mind, here’s why your lip might curl in disgust when confronted with the title question in my recent article on the difference between classification, regression, and prediction: There is no classification.

There is no classification.

Let me explain.

Image for post
Image for post
There is no classification… and regression is something else entirely. Meme template from The Matrix.

Discrete versus continuous

Back when dinosaurs roamed the earth, it was fashionable to kick a statistics textbook off with a first chapter on the basics of data. To make sure that students had something to memorize for their first test, opening chapters usually featured this jargon:

  • Continuous data (measured, not counted), e.g. 173.5…

Continuous, discrete, categorical, cardinal, sequential… keep going!

Close your eyes and try to name as many data types as you can. Got them? Now let’s play bingo! (Look for the bolded words.)

Data types in statistics and analytics

Back when dinosaurs roamed the earth, it was fashionable to kick a statistics textbook off with a first chapter on the basics of data.

Image for post
Image for post
Don’t worry, it’s not this complicated of a data taxonomy. Some of these critters look wonderfully derpy. Image: SOURCE.

To make sure that students had something to memorize for their first test, opening chapters usually included some jargon for different kinds of data:

  • Continuous data (measured, not counted), e.g. 176.5 cm (my height), 12% (free space on my phone), 3.141592… (pi), -40.00 (where Celsius meets Fahrenheit), etc.
  • Discrete data (counted…

Why it’s important to hire data engineers early

“What challenges are you tackling at the moment?” I asked. “Well,” the ex-academic said, “It looks like I’ve been hired as Chief Data Scientist… at a company that has no data.”

Image for post
Image for post
“Human, the bowl is empty.” — Data Scientist. Image: SOURCE.

I don’t know whether to laugh or to cry. You’d think it would be obvious, but data science doesn’t make any sense without data. Alas, this is not an isolated incident.

Data science doesn’t make any sense without data.

So, let me go ahead and say what so many ambitious data scientists (and their would-be employers) really seem to need to hear.

What is data engineering?

If data science is the discipline of…

Tips for identifying fakers and neutralizing their snake oil

You might have heard of analysts, ML/AI engineers, and statisticians, but have you heard of their overpaid cousin? Meet the data charlatan!

Attracted by the lure of lucrative jobs, these hucksters give legitimate data professionals a bad name.

Image for post
Image for post

[In a hurry? Scroll down for a quick summary at the bottom.]

Data charlatans are everywhere

Chances are that your organization has been harboring these fakers for years, but the good news is that they’re easy to identify if you know what to look for.

Data charlatans are so good at hiding in plain sight that you might even be one without even realizing it. Uh-oh!

In a nutshell, it’s all about loneliness

The curse of dimensionality! What on earth is that? Besides being a prime example of shock-and-awe names in machine learning jargon (which often sound far fancier than they are), it’s a reference to the effect that adding more features has on your dataset. In a nutshell, the curse of dimensionality is all about loneliness.

In a nutshell, the curse of dimensionality is all about loneliness.

Before I explain myself, let’s get some basic jargon out of the way. What’s a feature? It’s the machine learning word for what other disciplines might call a predictor / (independent) variable / attribute /…

Renaming that pesky little number and relearning how to use it

Image for post
Image for post

Is p for probability?

Technically, p-value stands for probability value, but since all of statistics is all about dealing with probabilistic decision-making, that’s probably the least useful name we could give it.

Instead, here are some more colorful candidate names for your amusement.

On the nature of analytics, part 2 of 2

Before we dissect the nature of analytical excellence, let’s start with a quick summary of three common misconceptions about analytics from Part 1:

  1. Analytics is statistics. (No.)
  2. Analytics is data journalism / marketing / storytelling. (No.)
  3. Analytics is decision-making. (No!)

Misconception #1: Analytics versus statistics

While the tools and equations they use are similar, analysts and statisticians are trained to do very different jobs:

  • Analytics helps you form hypotheses, improving the quality of your questions.
  • Statistics helps you test hypotheses, improving the quality of your answers.

If you’d like to learn more about these professions, check out my article Can analysts and statisticians get along?

Misconception #2: Analytics versus journalism/marketing

A look inside one of the most powerful tools of the tech trade

In a nutshell: A/B testing is all about studying causality by creating believable clones — two identical items (or, more typically, two statistically identical groups) — and then seeing the effects of treating them differently.

Image for post
Image for post
When I say two identical items, I mean even more identical than this. The key is to find “believable clones” … or let randomization plus large sample sizes create them for you. Image: SOURCE.

Scientific, controlled experiments are incredible tools; they give you permission to talk about what causes what. Without them, all you have is correlation, which is often unhelpful for decision-making.

Experiments are your license to use the word “because” in polite conversation.

Unfortunately, it’s fairly common to see folks deluding themselves about the quality of their inferences, claiming the benefits of scientific experimentation without having done…

Cassie Kozyrkov

Head of Decision Intelligence, Google. ❤️ Stats, ML/AI, data, puns, art, theatre, decision science. All views are my own.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store