Statistical Significance: The Basics

Richard Blissett | Last Updated 2017-10-17


The purpose of this document is to provide a broad overview of the concept of statistical significance. This overview is not meant to provide a deep understanding of statistical significance; I recommend that readers take an introductory statistics course or consult other materials for that kind of depth.

It is important to note that the approach to statistical significance as described in this overview relies on a set of philosophies particular to frequentist statistics. There is another paradigm, called Bayesian statistics, that does not subscribe to this kind of thinking. As of right now, however, the frequentist approach is the dominant approach in policy research.

Framing the problem

First of all, what is statistical significance about? What is the problem that people are trying to solve when they talk about statistical significance.

The problem posed by any test of statistical significance is this: Suppose we observe some pattern in the data that is in front of us. Should we infer anything from it regarding what we think the “true” value is in the population?

Here’s a concrete example: Suppose you administered a math test to 60 people. About half of the people had blond hair. The rest of the people had brunette hair. When you scored the tests, you found that the average score for blond people was ten points higher than the average score for brunette people.

In the world of descriptive statistics, we would stop here and simply report that in this sample of 60 people, blond people scored higher than brunette people, on average. This is not a controversial statement, in this scenario. This is just presenting the facts of the data.

However, the job of statisticians is often to go one step farther. From this analysis, can we conclude that blond people, in general, score higher on this test of math than brunette people? Can we infer that outside of this sample of 60 people, this pattern we find would repeat itself elsewhere, in the population as a whole?

This is the essential problem for statistical significance.

The logic of statistical significance

There is something important to keep in mind as we review some common tests for statistical significance.

  • All significance testing works the same way.
  • All significance testing works the same way.
  • All significance testing works the same way.

Here is my point in blue.

  • All significance testing works the same way.

Here it is in red.

  • All significance testing works the same way.

And here it is in visual form as communicated by my dog, Lilo.

Missing

Often, tests of significance are taught on an as-relevant basis. While this indeed makes sense for the purpose of having people understand their contexts separately, I find this often obscures the fact that all tests of statistical significance operate under the same tenants. While the formulas and distributions that they use are different, at their core, the logic for each of them is identical.

But what is the logic? The basic idea is that any pattern that we observe in our data could have looked different if we had drawn a different sample. So, using the example from before, if we have drawn a different sample of blonds and brunettes, we might have found that blonds only scored five points higher. In yet another sample, we might have found that blonds scored ten points below brunettes.

The lesson here is that across randomly drawn samples, the size of the difference we would have seen would vary just on the basis of chance. This world is a wonderful, random place with (almost) endless possibilities, so it is very likely that we would see a variety of data patterns across different randomly-drawn samples because of random chance alone. In order for us to conclude anything generally about the difference about blonds and brunettes based on our data, we have to make the claim that there is a low probability that the pattern we found was just a fluke. In other words, we think there is some systematic difference between the groups that makes it such that blondes score higher than brunettes.

No matter what the test of significance is, the basic logical steps are the same:

  1. You calculate some measure of the size of the pattern (called the “test statistic”) in your data.
  2. You compare that size to the range/distribution of the sizes that it could have been in different samples if the “true” size of the pattern was zero.
  3. You determine the probability that the size of the pattern is that far from zero based on that hypothetical distribution.
  4. If it’s a low probability, you conclude that you think that the “true” size is, in fact, not zero. If it’s a high probability, you conclude that based on your data, you can’t conclude with confidence that it wasn’t just a fluke.

What distinguishes all of the statistical tests from each other are two mechanics: (a) how we calculate the size of the pattern and (b) the distribution of sizes that we compare it to. Determining which statistical test we use depends on what kind of pattern we are observing.

While this is the logic, there is a formal process and set of terms that accompany this logic. What I write below is exactly the same as what I have written above, but using formal language.

  1. We establish a “null hypothesis” (\(H_0\)) that we should not be finding a pattern (that the pattern size is zero).
  2. We establish an “alternative hypothesis” (\(H_A\)) that the pattern we find is real.
  3. We calculate a test statistic that measures the size of the pattern.
  4. We compare that test statistic to what we think the distribution of the test statistic would have looked like over repeated samples if the “true” pattern size was zero.
  5. If the probability of obtaining a test statistic that extreme is below a certain threshold, we conclude that we are finding a real pattern, and we “reject” the null hypothesis. Typically, we consider a probability of below 10% or 5% to be sufficient evidence that we have found some real pattern in our data. We will return to this point later.
  6. If the probability of obtaining a test statistic that extreme is above that threshold, we conclude that we have not found evidence of a real pattern that isn’t zero, and thus we “fail to reject” the null hypothesis.

Why this wonky “fail to reject” language, instead of “accept the alternative hypothesis?” It may seem nitpicky as a difference, but it is important to know the difference. The series of steps we established above only test if the pattern is different from zero, NOT if the size of the pattern we found itself is right. For all we know, the real pattern could be even bigger. Or alternatively, it could be smaller, but still different from zero. Tests of statistical significance do not test if the value of the pattern we find is “correct.” They only tell us if the pattern is, by our standards of evidence, not zero.

Applying the logic

Let us say that we took a random sample of 60 blond and brunette people. The data from this sample is shown below, with the histogram for blonds in red and the histogram for brunettes in blue.

plot of chunk unnamed-chunk-4

In these data, we have 33 blond people and 27 brunette people. The averages for the two groups are 65.31 and 72.02, respectively. The standard deviations are 11.28 and 6.57, respectively. The difference, then, between the two groups is -6.71. According to all of what we have said above, the point of significance testing is to make some decision, based on this sample data, about whether we think this pattern exists in the population. In other words, is there a true difference in the population in math scores between blonds and brunettes?

Steps 1 and 2: The hypotheses

Alright, so our null hypothesis is that there is not a real difference in the population, and our alternative hypothesis is that there is.

\(H_0\): There is not a difference between mean scores of blonds and brunettes in the population.
\(H_A\): There is a difference between blonds and brunettes in the population.

Using more formal mathematical notation, we might write this as the following.

\(H_0: \mu_{blond} = \mu_{brunette}\)
\(H_A: \mu_{blond} \neq \mu_{brunette}\)

Wait, but what about that big “zero” think I was making a big deal about before? It’s here, just not explicitly. The “pattern” we are investigating is the difference between two means. The hypothesis we are testing is that there is no difference. Note that the following re-write of the hypotheses below implies the same exact thing as the ones I wrote above.

\(H_0: \mu_{blond} – \mu_{brunette} = 0\)
\(H_A: \mu_{blond} – \mu_{brunette} \neq 0\)

Step 3: The test statistic

We calculate some measure of the “size” of the pattern. In this case, this measure is called a “t-statistic.” I cover the calculation of the t-statistic in the next section, but know that the value of the t-statistic here is -2.87.

Steps 4-6: Compare to hypothetical distribution

Based on what we know about how a t-statistic works, we know what the distribution of a t-statistic would look like if the null hypothesis were true. It would look something like this.

plot of chunk unnamed-chunk-6

Based on a threshold of 0.05, we want to say that we will reject the null hypothesis if the chance of getting a t-statistic of -2.87 is less than 0.05. In other words, we will reject the null hypothesis if the t-statistic lies anywhere within the shaded region below.

plot of chunk unnamed-chunk-7

Lo and behold, looks like our t-statistic is far out there. Thus, we reject the null hypothesis that the means of the two groups are equal. Formally, we can calculate that the p-value is 0.0059, which is less than 0.05.

Common tests of statistical significance

Here, I cover several common tests of statistical significance. Remember: they all work exactly the same. Don’t get caught up in the formulas and the distributions. Just know they all work the same. Note that for the z-tests and t-tests, there are one-sided and two-sided alternatives, but given the relative prevalence of two-sided tests, I will only cover those here.

One-sample z-test

Purpose: Determining if the average of your sample data (\(\overline{x}\)) is different from a population average (\(\mu\)). (If you know the standard deviation of the population (\(\sigma\)) or the sample size (\(n\)) is over 30.)

\(H_0\): \(\overline{x} = \mu\)
\(H_A\): \(\overline{x} \neq \mu\)

Formula:

\[z = \frac{\overline{x}-\mu}{SE} = \frac{\overline{x}-\mu}{\frac{\sigma}{\sqrt{n}}}\]

Wait, but didn’t we just say that this is supposed to be testing if our “pattern” is different from 0? I don’t see 0 anywhere up there! Calm down. All of the above can be rewritten to explicitly include the 0 – we often just don’t include the 0 because it’s not useful.

\(H_0\): \(\overline{x} – \mu = 0\)
\(H_A\): \(\overline{x} – \mu \neq 0\)

Formula:

\[z = \frac{(\overline{x}-\mu)-0}{SE} = \frac{(\overline{x}-\mu)-0}{\frac{\sigma}{\sqrt{n}}}\]

Notice that this is the same equation as the other one above. We are testing if the difference between the sample mean and the population mean is significantly different from 0. This the exact same as saying that we are testing if the sample mean is significantly different from the population mean.

Once we calculate the z-statistic, we then compare it to what we think the distribution of the z-statistic across repeated samples could have looked like if the null hypothesis was true. Fortunately, we know this in the statistical world. The distribution of the z-statistic would have been the normal distribution with a mean of 0 and a standard deviation of 1, as shown below.

plot of chunk unnamed-chunk-8

So for any z-statistic, we know the probability of getting a value that extreme or more. This is called the “p-value.” You can typically obtain the p-value from a specific z-value by using some sort of statistical software or comparing the z-value to a table of p-values (this can be found in several places online). (Formally, this is calculated using an integral, but you don’t have to know that here.)

One-sample t-test

Purpose: Determining if the average of your sample data (\(\overline{x}\)) is different from a population average (\(\mu\)). (If you don’t know the standard deviation of the population and instead only know the sample standard deviation (\(s_x\)) and/or the sample size (\(n\)) is below 30.)

\(H_0\): \(\overline{x} = \mu\)
\(H_A\): \(\overline{x} \neq \mu\)

Formula:

\[t = \frac{\overline{x}-\mu}{SE} = \frac{\overline{x}-\mu}{\frac{s_x}{\sqrt{n}}}\]

This is called a t-statistic because it relies on a different distribution. While the z-statistic can be mapped to a normal distribution, the t-statistic can be mapped to the Student’s t-distribution. The additional quirk of the Student’s t-distribution is that the specific shape (and thus, probabilities) is additionally dependent on something called the “degrees of freedom” (“df”). This is calculated as \(df = n – 1\).

plot of chunk unnamed-chunk-9

Once the degrees of freedom goes beyond about 40, the Student’s t-distribution pretty much resembles the normal distribution, so you often see the z-test and the t-test being used interchangeably in large samples.

Independent two sample t-test

Purpose: Determining if the average of one sample (\(\overline{x}_1\)) is statistically different from the average of another sample (\(\overline{x}_2\)).

\(H_0\): \(\mu_1 = \mu_2\)
\(H_A\): \(\mu_1 \neq \mu_2\)

Notice that we write the null hypothesis using Greek letters to indicate that our hypotheses are about the population parameters.

Formula (if you assume that in the population, \(\sigma^2_1 = \sigma^2_2\)):

\[t = \frac{\overline{x}_1-\overline{x}_2}{SE} = \frac{\overline{x}_1-\overline{x}_2}{s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} = \frac{\overline{x}_1-\overline{x}_2}{\sqrt{\frac{(n_1 – 1)s^2_1 + (n_2 – 1)s^2_2}{n_1 + n_2 – 2}}\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\]

Formula (if you assume that in the population, \(\sigma^2_1 \neq \sigma^2_2\)):

\[t = \frac{\overline{x}_1-\overline{x}_2}{SE} = \frac{\overline{x}_1-\overline{x}_2}{\sqrt{\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2}}}\]

This is still called a t-statistic, even though the formula is different, because it still can be mapped to the Student’s t-distribution. Degrees of freedom, here, is calculated as \(df = n_1 + n_2 – 2\).

Dependent/paired sample t-test

Purpose: Determining if the average of one sample (\(\overline{x}_1\)) is statistically different from the average of another sample (\(\overline{x}_2\)), where the samples are paired. In other words, each unique observation in one sample is linked to a unique observation in the other sample (e.g., test scores at two time points for the same individuals).

\(H_0\): \(\mu_1 = \mu_2\)
\(H_A\): \(\mu_1 \neq \mu_2\)

Formula:

\[t = \frac{\overline{x}_1 – \overline{x}_2}{SE} = \frac{\overline{x}_1 – \overline{x}_2}{\frac{s_{x_1-x_2}}{\sqrt{n}}}\]

Degrees of freedom is \(df = n-1\).

One-way ANOVA

Purpose: Determining if several (more than two) sample means are all the same across all \(k\) categories.

\(H_0\): \(\mu_1 = \mu_2 = \mu_3 = … = \mu_k\)
\(H_A\): At least one \(\mu_i\) is different.

Formula:

\[F = \frac{MSB}{MSW} = \frac{\frac{\sum{n_i(\overline{x}_i-\overline{x}_{total})^2}}{k-1}}{\frac{\sum_i\sum_j{(x_{ij}-\overline{x}_{total})}}{n_{total}-k}}\]

This is known as an F-statistic, and it can be compared with the F distribution. Like the Student’s t-distribution, it depends on degrees of freedom. The F distribution depends on two degrees of freedom, the first being calculated as \(df_1 = k-1\) and the second one calculated as \(df_2 = n_{total}-k\).

plot of chunk unnamed-chunk-10

Chi-square test of independence

Purpose: Determining if there is a significant relationship between two categorical variables.

\(H_0\): There is no association between the two variables.
\(H_A\): There is an association between the two variables.

Formula (after calculating the \(r\times c\) contingency table):

\[\chi^2 = \sum{\frac{O_{rc}-E_{rc}}{E_{rc}}}= \sum{\frac{(n_{rc}-\frac{n_r\times n_c}{n})^2}{\frac{n_r\times n_c}{n}}}\]

The \(\chi^2\)-distribution, like the Student’s t-distribution, also relies on one degrees of freedom parameter, calculated as \(df = (r-1)(c-1)\).

plot of chunk unnamed-chunk-11

The haunting of the p-value

As mentioned, the last part of statistical significance testing in the frequentist tradition is the p-value. This value is the probability of seeing a pattern that far or more from 0 if in reality, a pattern did not exist in the population. How far is too far, though, to conclude that the real pattern is something to say anything about? The answer depends a bit on the field, but the common answer is that “too far” can be identified by a probability below 0.05.

Beyond this being somewhat an arbitrary threshold, the usage of the p-value as a decision-point has been widely abused throughout history, leading many scholars and practitioners to draw superficial conclusions by just keeping and reporting results that meet this threshold and discarding others. This misuse has led to a phenomenon that some in the field refer to as “star hunting.” This refers to the common convention of denoting statistical significance in publications using the asterisk symbol, and thus the tendency of some researchers to have a bias towards doing analyses that produce “stars” in their publications. (Indeed, there is some evidence that some of this bias is evident in the publication process.)

What should we do with the p-value? In 2016, in response to this problem, the American Statistical Association released The ASA’s Statement on p-Values: Context, Process, and Purpose. This (in my opinion) is very accessible document that lays out many of the dangers of relying too heavily on p-values. Importantly, they provide six important principles about the use of the p-value that I think are very important to keep in mind. I have copied them, verbatim, here:

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

In short: You should go and read that article.

I hope I have provided a useful overview of statistical significance. Let me know if there is anything you think is missing.