Hypothesis Testing: Type I & Type II Errors

The Setup

A courtroom for data

Imagine you're a jury. The defendant (the null hypothesis, H₀) is on trial. Your job is not to prove innocence — it's to decide whether the evidence is strong enough to convict.

In statistics, the null hypothesis is the default claim — the status quo. For example: "the mean daily return of this portfolio is zero." The alternative hypothesis (H_a) is the challenger: "the mean daily return is not zero."

You never "accept" the null. Just like a court verdict of "not guilty" isn't the same as "innocent," failing to reject H₀ doesn't mean it's true. It means the evidence wasn't strong enough to overturn it.

Analogy

The null hypothesis is the defendant. The sample data is the evidence. The significance level is the burden of proof. And your decision — reject or fail to reject — is the verdict.

Where this breaks down: unlike a real courtroom, the threshold for "conviction" (α) is chosen by the researcher before the trial begins, and there's no jury deliberation — the test statistic either crosses the critical value or it doesn't.

Two Errors

Two ways to get it wrong

Every hypothesis test can produce one of four outcomes. Two are correct decisions, and two are errors. The errors have names:

Type I Error — Rejecting H₀ when it's actually true. You convicted an innocent defendant. In our courtroom analogy, this is a wrongful conviction. The probability of this happening is called α (alpha), the significance level.

Type II Error — Failing to reject H₀ when it's actually false. The guilty defendant walked free. The probability of this is β (beta). The complement, 1 − β, is called the power of the test — the probability of correctly catching a false null.

	H₀ is True	H₀ is False
Fail to Reject H₀	Correct Decision	Type II Error (β)
Reject H₀	Type I Error (α)	Correct Decision Power = 1 − β

Notice the asymmetry: the researcher chooses α directly (by setting the significance level), but β depends on multiple factors including the sample size, the true value of the parameter, and α itself.

The Key Difference

The unavoidable tradeoff

Here's what makes this tricky: for a fixed sample size, reducing α increases β. If you demand more evidence to convict (lower α, say from 5% to 1%), you make it harder to catch a truly guilty defendant (higher β, lower power).

Think about it this way. If a 5% significance level sets your critical z-value at ±1.96, then moving to 1% pushes it to ±2.58. Now the test statistic needs to be more extreme to reject H₀. That means fewer false convictions — but also fewer true ones.

The only way to reduce both errors simultaneously is to increase the sample size. More data gives you a tighter estimate, which makes it easier to distinguish a true effect from noise without loosening your conviction standard.

Predict

Before you see it — commit

Your Prediction

A researcher changes their significance level from 5% to 1%, keeping everything else the same. What happens to the power of the test?

Correct. Lowering α from 5% to 1% moves the critical values further into the tails. This makes it harder to reject H₀ even when it's false, which means β increases and power (1 − β) decreases. You're protecting against false positives at the cost of missing real effects.

Not quite. Lowering α makes the test more conservative — you need more extreme evidence to reject H₀. That means you're less likely to reject it even when it's actually false (higher β). Since power = 1 − β, power goes down. The only way to recover power while keeping α low is to increase sample size.

See It

Watch the tradeoff unfold

Use the sliders to change the significance level and sample size. Watch how the rejection regions (shaded tails) shift, and how the error probabilities respond. The blue curve is the distribution under H₀ (null is true). The orange curve shows the distribution if the true mean is actually different from the hypothesized value.

Interactive Simulation

Significance Level (α) 5%

Sample Size (n) 30

True Effect Size (δ) 0.4

5.0%

P(Type I) = α

42.3%

P(Type II) = β

57.7%

Power = 1 − β

What you just saw: lowering α shrinks the rejection regions, making Type I errors less likely — but the alternative distribution's overlap with the non-rejection zone grows, meaning Type II errors rise and power drops. The only lever that helps both simultaneously is sample size: more data sharpens both distributions, reducing their overlap.

The Process

The 7-step hypothesis testing procedure

Every hypothesis test follows the same structured workflow. Memorize the sequence — it's tested directly on the exam.

1

State the hypothesis. Define H₀ and H_a. The null always includes the "=" sign.

2

Select the test statistic. Choose based on the parameter, distribution, and whether σ is known (z-test vs. t-test).

3

Specify the significance level. Set α (commonly 0.05 or 0.01). This determines the critical values.

4

State the decision rule. Define the rejection region: e.g., reject H₀ if |z| > 1.96 for a two-tailed test at α = 0.05.

5

Collect the sample and compute the statistic. Calculate the test statistic from sample data.

6

Make a decision. Compare the test statistic to the critical value(s). Reject or fail to reject H₀.

7

Make a decision based on the results. Interpret what the statistical conclusion means in context.

Now the Formulas

Formalizing what you already understand

The test statistic measures how far the sample result is from the hypothesized value, scaled by the precision of your estimate. This is exactly what you saw in the simulation — the test statistic is the gap between the sample mean and H₀'s value, measured in standard error units.

test statistic = (x̄ − μ₀) / (σ / √n)

When σ is known (z-test). Replace σ with s for the t-test.

Each piece maps to a driver you already know. x̄ − μ₀ is the observed departure from the null. σ / √n is the standard error — the noise level. When n goes up, the denominator shrinks, pushing the test statistic further into the tails. That's why larger samples give you more power: the same effect produces a more extreme test statistic.

Key critical values to memorize

For z-tests, the commonly tested critical values are: ±1.65 for a two-tailed test at 10% (or one-tailed at 5%), ±1.96 for a two-tailed test at 5%, ±2.33 for a one-tailed test at 1%, and ±2.58 for a two-tailed test at 1%.

The p-value

The p-value is the smallest significance level at which you would reject H₀. If the p-value is less than α, you reject. It gives you a continuous measure of evidence strength — rather than just a binary reject/fail-to-reject verdict.

In plain English: the test statistic tells you how many standard errors your sample is from what the null hypothesis predicts. If it's far enough out in the tails (beyond the critical value), you reject. If not, the evidence isn't strong enough.

The One-Liner

Type I is a false alarm (rejecting a true null, probability α); Type II is a miss (failing to reject a false null, probability β). For a fixed sample size, protecting against one error increases the other — only more data helps both.

Test Yourself

Apply what you learned

Question 1

A pharmaceutical company tests whether a new drug lowers cholesterol more than a placebo. They set α = 0.05 and fail to reject H₀. Later, a larger study shows the drug actually does work. What error did the original study make?

Type I would mean they rejected H₀ when it was true. But they failed to reject, and H₀ turned out to be false — that's a Type II error.

Correct. The drug actually works (H₀ is false), but the original study didn't have enough evidence to reject H₀. That's the definition of a Type II error — likely caused by insufficient sample size.

Following the procedure correctly doesn't prevent errors. Their sample may have been too small to detect the real effect, leading to a Type II error.

Question 2

An analyst runs a two-tailed z-test at α = 0.05 and gets a test statistic of 2.14. The p-value for this test is closest to:

The p-value is not fixed at α. It's determined by the test statistic. Since 2.14 > 1.96 (the critical value at α = 0.05), the p-value must be less than 0.05.

Correct. For a two-tailed test, the p-value is 2 × P(Z > 2.14) ≈ 2 × 0.0162 ≈ 0.032. Since 0.032 < 0.05, we reject H₀. The p-value is the smallest α at which you'd still reject.

0.016 is the one-tail probability P(Z > 2.14). For a two-tailed test, you must double it: 2 × 0.016 ≈ 0.032.

Question 3

An investment manager wants to test if a fund's alpha is significantly different from zero. She currently uses α = 0.10 with n = 50 and wants to increase the power of the test. Which combination of changes would achieve this?

Decreasing α moves the critical values further into the tails, making it harder to reject H₀. With the same sample size, power decreases.

Increasing α would help power, but decreasing n hurts it — less data means a wider standard error and less ability to detect real effects. These forces likely cancel out or make things worse.

Correct. Keeping α the same and increasing n is the cleanest way to boost power. A larger sample reduces the standard error, which makes the test statistic more sensitive to real departures from H₀, without changing the false positive rate.

Tricky. Increasing n helps power, but decreasing α from 0.10 to 0.01 substantially hurts it. These two effects work against each other. The n increase may not be enough to overcome the dramatic tightening of α.

Hypothesis Testing: Type I & Type II Errors

Theo Leimer

A courtroom for data

Two ways to get it wrong

The unavoidable tradeoff

Before you see it — commit

Watch the tradeoff unfold

The 7-step hypothesis testing procedure

Formalizing what you already understand

Key critical values to memorize

The p-value

Apply what you learned

Read more

How to pass the CFA Level 1 in one go

Chief of Staff, Western Markets & Switzerland

Constructing Hypothesis Tests: t, χ², and F

13 Reasons Why I'm Pivoting from Marketing to Finance