Power analysis (English)

From Wikistatistiek
Jump to navigation Jump to search

This text is an edited version of the AMC sample size calculation manual [1]. It provides a practical guide into sample size calculations used in clinical research. After reading the manual, a researcher will know:

  • why power analysis is used to plan and evaluate medical research
  • what power and statistical significance mean
  • what information is needed for a sample size calculation
  • where to find the information needed
  • how to perform a simple sample size calculation
  • how to write down a power calculation.

In addition, the manual contains two practical examples of sample size calculations.

Why perform a sample size calculation?

The main reasons for perfomring a sample size calculation are ethical. If the number of subjects tested in a study is too small to detect the effect being investigated, the subjects will be subjected to the risks of participating in the study in vain.

The study will easily result in a false negative conclusion. On the other hand, testing too many subjects may also lead to undesirable situations. If an intervention turns out to be effective, too many subjects have missed out on this intervention. If the intervention is not effective, too many have been exposed to this ineffective intervention. For these reasons a trial should always consider what number of subjects would be appropriate to answer the study question. Sample size calculations prior to a study can help focus on the number of subjects that is needed and sufficient for a study. Moreover, a sample size calculation helps one to focus on a clinically relevant effect, instead of the erroneous strategy of testing as many subjects as needed to reach statistical significance of an irrelevant effect.

The CONSORT statement (guideline for reporting clinical trials) states that a researcher should calculate study size on beforehand and should report this calculation in the methods section of the resulting scientific paper. The AMC Medical Ethics Board (MEC) and the Animal Experiments Committee (DEC) also ask for a power calculations in the approval process. The same holds for most study grant applications (e.g., ZonMW).

Finally, the logistic planning of a study benefits from a sample size calculation.

Power and statistical significance

The term ‘power’ pops up everywhere in medical research, certainly in sample size calculations. Often, the term power is interpreted as a synonym for the number of patients tested in a study. ‘Our study did not have enough power to control for possible confounders’ is understood as ‘you didn’t test enough patients to account for several effects’. ‘Our study had 80% power to detect an OR of 1.1 at a significance level of 5%’ is understood as: ‘you have tested enough patients to pick up a possible effect’. Although these interpretations are not (absolutely) wrong, in order to use the concept of power in a sample size calculation, we need to understand its exact meaning. Formally: the power of a study testing the null hypothesis H0 against the alternative hypothesis H1 is the probability that the test (based on a sample from this population) rejects H0, given H0 is false (in the whole population). So the power is the chance of correctly rejecting a null hypothesis (rejecting a null hypothesis given it should be rejected). Since in most tests H0 is stated as ’no difference between groups or no effect of intervention’, for example H0 = ’no difference in survival between treated and control group’, rejecting H0 means you have reason to believe there is a difference. In other words, the power reflects the ability to pick up an effect that is present in a population using a test based on a sample from that population (true positive). The power of a study is closely related to the so called type II error (β), the probability of falsely accepting H0. The power of a study is 1 − β, so it is the probability of rightfully rejecting H0 (see Table 1). In the table also the significance level α is stated. Alpha is the probability of falsely rejecting H0, i.e., falsely picking up an effect (false positive). Note that α only concerns about situations in which no true effect exists in the population. In a sample size calculation one determines the number of patients needed to test the hypothesis with large enough power and small enough significance level. In this way one protects oneself against false negative and false positive conclusions. Table 1: Possible conclusions and errors of a study in relation to the truth. Whole population effect exists no effect exists H1 is true H0 is true effect observed true positive false positive Study H1 appears true power (1 − β) type I error (α) conclusion no effect observed false negative true negative H0 appears true type II error (β) (1 − α)

Information required to calculate a sample size

To make a sample size calculation based on the power of a study one will need information about each of the following values: Desired power of the study 1 − β How much power do you want in the study? Or, stated differently, how certain do you want to be of preventing a type II error? Desired significance level α How certain do you want to be of preventing a type I error? Desired test direction One or two sided test? Clinically relevant (or expected) difference Which difference or which effect are you trying to find?

Expected variance / standard deviation How much variation is expected in subjects belonging to the same study group? Test to be used in statistical analysis How will the hypothesis test be performed in the analysis phase of the study? Attrition rate Anticipate on the number of included subjects who will not be available for the study analysis.


Sources of information for sample size calculations

In this section we advice on how to determine or choose the necessary input values for a sample size calculation. Desired power of the study 80% is a common power level used in sample size calculations. It means that you accept a chance of 20% (one in five) of failing to detect an effect in your study sample that is indeed present in the population (false negative). If you want to reduce the change to miss out a certain effect, you should increase the power level for instance to 90%. Increasing the power level will increase the sample size. Desired significance level 5% is a common significance level used in hypothesis testing. This means that you accept a chance of 0.05 to detect an effect in your study that is not present in the whole population (false positive). A reason to lower the significance level might be that multiple tests are done and you do not want to detect an effect just by increasing the odds of finding a false positive. Lowering the significance level will increase sample size. Desired test direction A two sided test is standard. It means that you test the possibility that treatment A is better than treatment B and the other option (treatment B better than treatment A) simultaneously. A one sided test can only be considered when a clear rationale is provided about why only one direction of the alternative hypothesis is tested (ethical committees and journals are quite strict on this point, some even reject all one sided tests). See e.g. Knottnerus (2001) or Peace (1989) for considerations about using a one sided test for sample size calculation. Clinically relevant (or expected) difference. Here you have to define the difference that you would like to detect with your study. It can be the effect that has been found in previous studies and that you would like to reproduce. Or in situations where there are no previous studies, you can define a difference that you consider clinically relevant. Information can be found in previous studies found in literature or can be based on expectation from clinical practice. Since a small effect is more

difficult to pick up than a large effect, decreasing the difference (or effect) will increase sample size. Note: frequently, available time and resources do not allow the conduct of a clinical trial large enough to reliably detect the smallest clinically relevant effect. In these cases, one may choose a larger difference, with the realization that should the trial result be negative, it will not reliably exclude the possibility of a smaller but clinically-important treatment difference (Lewis, 2000). In general, you have to find a balance between defining a large(r) effect that is easier to pick up (i.e. requiring fewer subjects) and running the risk of obtaining a non-significant result if the difference turns out smaller. Expected variance / standard deviation. This should be based on pilot data or previous projects in your institute or comparable studies found in literature. When no direct estimate of the standard deviation is available, nQuery (next Section) offers some help tables. If high variation exists between subjects, a difference between groups or an effect of intervention will be harder to pick up, so more spread in the data will increase sample size. Test to be used in the statistical analysis A power calculation will always be based on one particular statistical analysis. Therefore, the sample size calculation forces you to think about the planned data analysis in a very early phase of the study. Help on the correct choice of analysis can for instance be found on the ’wiki biostatistiek’ (http://biostatistiek/mediawiki, available from the AMC network). Attrition rate Previous studies in the same population will give an estimate of the expected number of included subjects who will not be available for analysis. This may be caused by dropout or withdrawal from the study. Study burden, follow up length and for instance age will influence the attrition rate. The simplest form of attrition, i.e. attrition not related to the intervention or the outcome, can be easily corrected for in the sample size calculation. After calculation of sample size, adjust so that the number needed remains after expected loss of study subjects. For example: if an attrition rate of 10% is expected, divide the number needed by 0.9 (1-attrition rate). Since one never knows the exact difference, variation or attrition rate of a study on beforehand, a sample size calculation remains a difficult exercise. Repeat the calculation using slightly different input values and check the consequences of these modifications. If absolutely no information is available for the estimation or the necessary input values, one may consider doing a pilot study first.


Software

Several high quality computer programs exist for performing sample size calculations. The AMC has a license for the program nQuery Adviso and for more advanced study designs, the Helpdesk Statistics access to the program NCSS PASS 15. On the internet several free power programs exist. Reliability of these free programs is not always guaranteed.

Advanced topics

Some study designs require other types of sample size calculations. Reseachers planning these types of studies should contact a statistician to discuss the calculations required.

Equivalence design
In an equivalence design you do not want to test for differences, but to show equivalence. In such a design you will need to specify what your interpretation of similar is. Perfect equivalence can never be demonstrated. A limit has to be determined of which small difference between groups will be considered not meaningful and lead to the conclusion of equivalence, this is called the equivalence limit difference. Also the expected difference between groups has to be given. A special type of equivalence designs is a non-inferiority design. In this design one is interested in equivalence in only one test direction. For instance when a new, less invasive diagnostic procedure is compared to the current invasive one, the new procedure does not have to prove better than the current one. If it has at least similar diagnostic strength as the invasive one it would be preferred.
Clustered design
In some studies, patients are randomized inclusters instead of individually. Examples of clusters are physicians' practices or hospital wards. In this type of study, the outcomes of patients within a cluster are not statistically independent of each other and the correlation between patients needs to be included in the sample size calculation. If multiple observations per patient are obtained, the power calculation has to be suited to take along the correlation between measurements in the same patient.
Advanced analyses
Planned statistical analyses such as survival analysis, regression analysis, and reliability analysis call for their own specific sample size calculation

Referenties

  1. van Geloven N, Dijkgraaf M, Tanck M, Reitsma J. AMC biostatistics manual - Sample size calculation. 2009. Amsterdam: Academic Medical Center.

    [geloven2009]