*The purpose of this article is to give data scientists and interested nerds intuition to what’s behind the T-test and p-value. Statisticians, please don’t get too picky on me here. *

## Introduction

As data scientists we try to change the world by finding game breaking insights, predicting the future with hardcore machine learning algorithms and answering sensitive questions with high certainty.

All these things we accomplish by transforming, aggregating, wrangling & modeling **data**. Indeed, we use the data provided to find correlations, claim relations or even causations. These of course we validate against more (unseen) data, to improve our claims and models. The data we use in this process acts as *evidence* to support our findings.

The findings and models we formulate are called **hypotheses**. Hypotheses are meant to have explanatory power but do not have enough evidence built up to become a **scientific theory**. As we have heard a thousand times; correlation does not imply causation. So how can we test our hypotheses?

In this blog post we’ll make up an example, start introducing statistical concepts one-by-one and leave you with a warm feeling by the end of the article.

## Hypothesis Testing

Say we want to see if smoking while driving has an impact on *how long* people drive from point A to point B.

To see if this **effect (change in driving time while smoking)** takes place we would measure how long it takes for people to go from A to B while smoking, and without smoking.

This would define 2 groups within a population.

– Group 1: Drivers driving without smoking

– Group 2: Drivers driving while smoking

In science, an **effect** does not take place unless shown otherwise. This is the starting position and assumption made for every hypothesis test in science. We call it the **Null Hypothesis: \(H_0\)**. The Null Hypothesis is expressed as the *means* between 2 groups of interest (\(\mu_1 ,\mu_2\)) in a given situation being equal. In symbols the previous would be written as follows:

$$

{H_0: \mu_1 = \mu_2}

$$

Now here comes the sexy part: if you can show that the means are **not equal**, you can reject \(H_0\) in favor of the **Alternative Hypothesis: \(H_1\)**. In symbols:

$$

{H_1: \mu_1 \ne \mu_2 }

$$

Rejecting \(H_0\) and accepting \(H_1\) means that the effect you expect, takes place. In this example: if we measure a different mean for our groups of interest, smoking indeed *has* an impact on the driving time between points A and B.

## Student’s T-Test

There are a few issues we didn’t consider yet when talking about rejecting the null hypothesis.

– We can’t invite all smoking and non smoking drivers to participate in our measurements. We’ll have to work with a sampled population.

– The difference in mean might have resulted by cheer chance instead of by some effect.

To avoid these problems, statisticians have created the **T-test **(actually, the creator is William Sealy Gosset). In the T-test you will **calculate a t-value** which tell you *how strong* the effect you see is. You have to compare this calculated t-value to a **critical t-value** to determine if you can reject \(H_0\).

If \(t_{Calculated} > t_{Critical}\) you can reject \(H_0\) and accept \(H_1\).

If \(t_{Calculated} \leq t_{Critical}\) you failed to reject \(H_0\).

Since we work with independent means between 2 groups, we will make use of the **Independent T-Test**. The t-value is calculated as follows:

$$

t_{Calculated} = \frac

{\overline{X}_1-\overline{X}_2}

{

\sqrt{ \frac{\sigma^2_1}{N_1} + \frac{\sigma^2_2}{N_2} }

}

$$

Where \(\overline{X}\) is the mean, \(\sigma^2\) is the standard deviation and \(N\) is the sample size of group 1 and 2. The formula can be intuitively interpreted as the *‘signal’ divided by the ‘noise’ level*.

$$

t_{Calculated} = \text{Strength Of Effect} = \frac{Signal}{Noise}

$$

A few interesting things can be derived from this formula:

– \(t_{Calculated}\) is stronger if the means between 2 interest groups is bigger.

– A larger standard deviations (\(\sigma^2_1,\sigma^2_2\)) results in larger \(Noise\). More \(Noise\) means a less strong t-value.

– More samples (\(N_1,N_2\)) means more *certainty* of our measurement and thus leaving us with less \(Noise\). Less \(Noise\) results in a strong t-value.

So what does \(t_{Critical}\) look like and what is it ? To understand \(t_{Critical}\) we need to understand what we actual aim for. For example: It could be that by some weird chance the effect you see in the data is due to random chance.

**The probability that the effect you observe is not an effect but happening due to chance, we call the p-value.**

Obviously we want to have the p-value small, smaller in fact than a given significance level (\(\alpha\)), by default the \(\alpha\) is 0.05. Let’s visualize this.

The blue area on the figure would represent occurrences in the dataset that are considered not expected whereas the green area signifies occurrences that are considered expected. The p-value basically signifies unexpected occurrences going outside the green zone, **possibly behaving like normal occurrences in a different interest group.**

The next value we need for \(t_{Critical}\) is the degrees of freedom (\(df\)). For an independent T-test the degrees of freedom can be calculated as:

$$

df = N_1 + N_2 – 2

$$

Intuitively, see the degrees of freedom as how many samples there are that can be *freely* picked. Why minus 2 ? Because once we’re left with the last sample of a group, you have no more *choice* or *freedom* but the pick the last one. Since we have 2 groups; minus 2.

Statisticians have computed \(t_{Critical}\) for a large range of \(df\)s and p-values which can be found in a neat table online or by using software of your liking. Based on the \(t_{Critical}\) you’ve found, you can now compare it to \(t_{Calculated}\); to reject or fail to reject \(H_0\).

## More intuition to the t-value

To illustrate what the T-value signifies, checkout the next graph:

Sticking to the example given before, these 2 distributions of the 2 groups were plotted by assuming the following values for \(\mu\) and \(\sigma^2\).

- Group A (n-smokers): \(\mu=90\) and \(\sigma^2=5\)
- Group B (smokers): \(\mu=100\) and \(\sigma^2=5\)

These numbers mean that for drivers that don’t smoke they took 90 minutes from point A to point B on average. For drivers that did smoke during this trip took on average a whopping 10 minutes longer (this is of course ridiculous, but let’s keep it at it for sake of this example).

Given that there were 20 people driving in

every group we can calculate \(T_{Calculated}=14.14\).

$$

t_{calculated} = \frac

{100-90}

{

\sqrt{ \frac{5}{20} + \frac{5}{20} }

} = 14.14

$$

Take notice of the overlapping blue areas of both groups. Also notice that the **red areas** or **p-value zones**. It should be clear that a person from group A (n-smokers) would only give a result as if he was from group B (e.g. driving ±100 minutes), ** if by some cheer luck he encountered some unlucky event on the road**. You can see this on the graph by looking at both means \(\mu_1,\mu_2\) falling in the p-value zone of the other group.Remember, the p-value zones signify zones where measurements are **unlikely**.

What will happen if we’d increase the \(\sigma^2\)? Let’s take a look if we increase it to \(\sigma^2=10\).

When \(\sigma^2=10\) and \(N=20\) for both groups, \(t_{Calculated}=10.00\).

$$

t_{calculated} = \frac

{100-90}

{

\sqrt{ \frac{10}{20} + \frac{10}{20} }

} = 10.00

$$

The \(t_{Calculated}\) is clearly smaller than the previous one. This would mean that **the effect is less strong than when \(\sigma^2=5\)**.

Notice how the blue zones overlap each other way more now. Recall that the blue zone is what we expect to happen in the measurements. If the blue zones overlap each others means (\(\mu_1,\mu_2\)), this means it is pretty much *expected* some drivers group 1 will results in a driving time as big as the \(\mu_2\) of group 2.

Interpret these findings as follows:

– A smaller \(t_{Calculated}\) signifies **less statistical significance**.

– A larger \(t_{Calculated}\) signifies **more statistical significance**.

This is why \(t_{Calculated}\) gets to be compared to \(t_{Critical}\), and if it’s larger, you can reject the sceptic’s \(H_0\) and keep your claim that smoking makes for a longer driving time.

## More info for technical correctness

The following assumptions are made when working with the independent student’s T-test:

– Samples are being randomly sampled

– Every sample is an independent observation

– The data is normally distributed

– There is a homogeneity of variance

– The t-value is symmetrical around 0, so take the `abs`

of it when comparing to \(t_{Critical}\).

## Conclusion

We are constantly bombarded by claims people make. Thanks to the scientific method we can tests these claims by having statistics as our friend.

Next time you’re in a discussion with someone, have a cheeky smile and ask them “bro, what’s your p-value though?”. And of course what you actually mean is “How certain are you that what you claim did not happen by cheer chance?”.

It’s common for both statisticians and data scientists to use statistical software, check if their model has \(p < 0.05\) and move on. This article was meant to give you some intuition behind that number on the screen, and I sincerely hope it did.

Until next time !

## Leave a Comment