English Banner

Statistical Power of a Study

  To Epidemiology theme page
To EBM theme page

1. Core Knowledge

There are two main reasons why a study may not show a significant difference between groups being studied (e.g. in a randomized trial of a new drug, or a case-control study testing the effect of an exposure on a disease).

  1. There really was no significant difference (hence a true negative result)
  2. There was a difference but the study failed to detect it (false negative result). This may arise because the study was poorly designed (e.g. used imprecise measurements) or because the study was too small (in statistical jargon, it "lacked power").

The power of a study is its ability to detect a difference, if the difference in reality exists

2. Nice to Know

Statistical power is affected by 3 factors:

  1. The difference in outcome rates between the two groups. A smaller difference requires exponentially more power.
  2. The level of significant difference you are hoping to show (e.g. p < 0.05 or <0.001). Chasing after a small p value takes more study power.
  3. The frequency of the outcome in the two groups. Imagine an exposure that increases incidence by a third: it is easier to show a difference between 30 and 45 percent than between 10 and 15 per cent.  Maximum power is reached when roughly half of the people studied have the outcome of interest.

One way to think about power is in terms of false negative results: power refers to the likelihood of avoiding a false negative.  [A bit like your motorcycle: the more power it has, the less likely it is to get stuck in the mud...]  Real statisticians use more complex terms, of course, and speak of power as the probability of not making a beta, or a "Type II" error, which refers to falsely concluding that there was no difference (e.g., between experimental and control groups) when in fact there was a difference, but the study failed to show it. [Unscramble my brain - please]

Any study involves only a sample of people from the population of interest, and there are several reasons why the study may fail to detect the real difference that exists in the population.  What factors influence whether or not a study will be able to detect a real difference?

Factors determining a study's power

3. Additional Information

When you design a study, you set the power level you require, just as you set the level of significance that you will accept as being “significant”. A power of 80% is often chosen; hence a true difference will be missed 20% of the time. This is a compromise because raising power to 90% power will require increasing the sample size by about 30% and raising it to 95% would entail a 60% increase in sample size, substantially increasing costs for the study.
The size of sample required for a study is generally calculated (based on estimates of (1) and (3) above) before a study begins, to indicate its power to detect a true difference. If a study subsequently finds a null result, power is normally re-checked using the actual results from the study to show how likely it was to have been a false negative result. There are various formulae for calculating power; this is where a physician would normally consult with a statistician.

The actual formula you will use to calculate power depends on the statistical test you will select to analyze the data; a statistics book will give details.  All you need to understand is the general concept since you will sometimes hear comments such as "but the study had low power".

The final (really cool) step is that you can also figure out how large a sample you will need for a study, if it is to detect a real difference of given size.  This requires an estimate of the true difference (e.g., between experimental and control groups) that you are trying to detect, the associated SD, and the level of power you wish to achieve (perhaps 85 or 90%).  Because you have not yet done the study, these estimates are usually taken from previous studies, or by specifying the smallest difference you wish to be able to detect (e.g., arguing that anything smaller would not be clinically important).

Statistical significance and clinical importance