Misc 04 Sample size and power

R Misc R

 

sumo
If power analysis were an athlete, it would be a sumo wrestler

 

Defending your sample size

 

Overview

Many scientific studies lack sufficient statistical power, failing to defend their sample size

 

Contents

4.1 Data simulation to explore sample size

4.2 Power and sample size calculation

4.3 Suggested reading

 

This page provides a brief overview on two methods to defend your sample size. First, methods in data and sample size simulation are examined. Then, we will look at tools for calculating power and sample size.

 

 

Data simulation to explore sample size determination

Here, we explore how to simulate data by random sampling from a population with a known (or estimated) mean and variance, or a known proportion. Such data sets can be useful for exploring design and analysis strategies prior to beginning an experiment.

Two vector functions you will use frequently are c() to concatenate values and rep() to replicate values. For example,

x1 <- c(1, 2, 3, 4, 5)             # Combine numbers in a vector
x2 <- c(x1, c(9, 8, 7))            # Combine two vectors into one
x <- rep(1, 10)                    # Make a vector with ten 1's
x <- rep(c(1, 2), 5)               # Make the vector 1 2 1 2 ... (5 times)
A <- rep(c("a", "b"), c(4, 2))     # Make the vector a a a a a b b

 

Gausian-distributed populations

In the following example we sample 5 random numbers from a Gaussian population having a mean of 2 and a standard deviation of 10. Repeat it several times to convince yourself that the outcomes are different each time. Modify the mean and sd values to see how this affects the results. Change the 5 to another number (e.g. a higer number like, 30) to obtain a different sample size.

myrand <- rnorm(n = 5, mean = 2, sd = 10)

# You can round the data to 2 decimal places using:

myrand <- round(x = myrand, digits = 2)

 

Categorical data

To sample 20 individuals from a population in which 40% of individuals are diseased and 60% are healthy,

sample( c("diseased", "healthy"), size = 20, 
        replace = TRUE, prob = c(.4, .6))

 

Two treatments, Gaussian data

The following commands generate a data frame with data from 20 individuals in two treatment groups (10 in each group). In this first example, the mean response is the same between treatments.

treatment <- rep( c("treat", "control"), c(10, 10))

response <- rnorm(20, mean = 10, sd = 3)

mydat <- data.frame(treatment, response, stringsAsFactors = FALSE)

mydat
   treatment  response
1      treat 12.316904
2      treat  4.645895
3      treat  7.052028
4      treat 13.003198
5      treat 12.727637
6      treat 11.788675
7      treat  8.790188
8      treat 10.916575
9      treat  5.706918
10     treat  9.572046
11   control  7.700946
12   control  7.935650
13   control  8.032526
14   control  7.726573
15   control 16.505601
16   control 12.831074
17   control  5.933128
18   control  9.521768
19   control  7.688485
20   control  9.102640

 

The following commands modify the above procedure so that the mean is different between treatment and control groups, but the standard deviation remains the same (the usual assumption of linear models).

treatment <- rep(c("treat", "control"), c(10, 10))
response1 <- rnorm(10, mean = 10, sd = 3)
response2 <- rnorm(10, mean = 8,  sd = 3)

mydat <- data.frame(treatment, response = c(response1, response2), 
                    stringsAsFactors = FALSE)

 

Two treatments, categorical data

The following commands generate a data frame with categorical data from 20 individuals in two treatment groups (10 in each group). The response variable is “dead” or “alive”. In this first example, the proportion alive is the same, 0.3, between treatments.

treatment <- rep(c("treat", "control"), c(10, 10))
survival <- sample(c("alive", "dead"), size = 20, 
                   replace = TRUE, prob = c(.3, .7))

mydat <- data.frame(treatment, survival, stringsAsFactors = FALSE)

table(mydat) # view the sampling results

         survival
treatment alive dead
  control     2    8
  treat       4    6

 

The following commands modify the above procedure so that the probability of survival is different between treatment (0.6) and control (0.3) groups.

treatment <- rep(c("treat", "control"), c(10, 10))

s1 <- sample(c("alive", "dead"), 10, replace = TRUE, prob = c(.6, .4))
s2 <- sample(c("alive", "dead"), 10, replace = TRUE, prob = c(.3, .7))

mydat <- data.frame(treatment, survival = c(s1,s2), stringsAsFactors = FALSE)
table(mydat) # view the sampling results

         survival
treatment alive dead
  control     3    7
  treat       6    4

 

 

Power and sample size calculation

The statistical power of a test is its probability of rejecting the null hypothesis when it is fact false. A rule of thumb minimum level of statistical power for experimental design is 80%. Here we introduce tools in R to calculate the sample size required to achieve a specific level of statistical power, given a specified magnitude of difference from the null hypothesis. You can also determine the power for a given sample size.

We’ll use functions in the {pwr} package. These functions are built on the methods of Cohen. 1988. Statistical power analysis for the behavioral sciences.

To begin, load the {pwr} package (you’ll need to download and install the package from the CRAN website if you haven’t already done so – it isn’t part of the standard installation).

library(pwr)

 

The procedures involve two steps. The first step converts your quantities of interest to an “effect size”. Effect sizes are standard measures of the magnitude of a difference or of the strength of the relationship between two variables. Effect size is used to design experiments, but it is also used in meta-analysis to compare results from diverse studies using different methods and types of variables.

Cohen developed these measures of effect size that are in common use.

The second step is to calculate power or sample size for the given or estimated effect size. In all the examples below we use a standard significance level of α = 0.05.

 

For a single proportion

The binomial test is used to compare the observed frequency of “success” and “failure”. The χ2 goodness of fit test is used in the same context when sample size is large. The null and alternative hypotheses for the test are

Null hypothesis:
My proportion p IS NOT DIFFERENT to some null proportion p0

HO:p=p0 

Alternative hypothesis:
My proportion p IS DIFFERENT to some null proportion p0

HA:p≠p0

 

Effect size for proportion test

Each different kind of statistical test you may perform will typically have an associated effect size. The effect size is different for different kinds of tests. E.g. a difference in means (relative to variation), a measure of assocation (like the correlation coefficient), etc.

The pwr.p.test() function calculates the statistical power (remember, the probability of rejecting the Null GIVEN it is actually false) of this test for a given sample size, or the sample size needed to achieve a specified power. pA refers to the proportion under the alternative hypothesis that you would like to compare. p0 is the null proportion to which you compare pA.

Cohen's h is the effect size here, here quantifying the different between p0 and pA.

# The effect size for comparing proportions is Cohen's "h"
# use the ES.h() to cacluate it
# convert your p's to Cohen's effect size "h"
h <- ES.h(p0, pA)           

# Returns SAMPLE SIZE needed to achieve power of 80%
pwr.p.test(h = h, power = 0.8)  


# Returns POWER of test when n = 50
pwr.p.test(h = , n = 50)   

 

For example, to determine the sample size needed to achieve 80% power of a test in which the null hypothesis is 0.5, and in which you hope to detect a proportion at least as large as 0.7, use


myh <- ES.h(0.5, 0.7)

pwr.p.test(h = myh, power = 0.8)

     proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.4115168
              n = 46.34804
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

 

For a 2x2 experiment

The Fisher exact test and χ2 contingency test are used to detect association between treatment and response in a 2 x 2 experiment. The {pwr} package has routines to calculate power and sample size in a 2 x 2 design for the χ2 contingency test. To use them, you need to specify he probability of “success” and “failure” in both treatment and control. The method below assumes that the sample size is the same in both the treatment and controls. For example, to determine the power of


control <- c(0.5, 0.5)                   # control prob. of success, failure
treatment <- c(0.3, 0.7)                 # treatment probs.
probs <- data.frame(treatment, control)  # a 2 x 2 data frame of p's
probs <- cbind(treatment, control)       # this works too: a 2 x 2 matrix of p's

w <- ES.w2(probs/sum(probs))             # Cohen's effect size "w"
w

[1] 0.2041241

 

To calculate the power of the test for a given total sample size,

# N is the total sample size
pwr.chisq.test(w, df = 1, N = 60) 

 

To calculate the total sample size needed to achieve a test with 80% power,

pwr.chisq.test(w, df = 1, power = 0.80)

 

For a two-sample t-test or anova

Cohen's d quantifies the effect size here.

If your response variable is normally distributed with equal variance in both groups, you would use a two-sample t-test (or, equivalently, an anova on two groups). The basic stats package in R includes a function to calculate power and sample size for this case, power.t.test().

To calculate power, first calculate delta, the known (i.e., expected, or estimated) difference between means. You will also need to know the standard deviation within each group. Then, to calculate power of an experiment for given sample size n in each group (2n overall, since we are assuming equal sample size…):

# The basic function for power for the t-test

# Calculate SAMPLE SIZE for an assumed 
# delta (DIFFERENCE between means) and
# sd (standard deviation around means, pooled sd of course)
power.t.test(delta = delta, sd = sd, power = 0.80)


# Calculate POWER for a given sample size
power.t.test(n = n, delta = delta, sd = sd)

 

 

Suggested reading

 

There is a lot more to this subject and it is very important.

 

Cohen, J., 1992. A power primer. Psychol Bull 112, 155–159. https://doi.org/10.1037//0033-2909.112.1.155

Colegrave, N., Ruxton, G.D., 2003. Confidence intervals are a more useful complement to nonsignificant tests than are power calculations. Behav Ecol 14, 446–447. https://doi.org/10.1093/beheco/14.3.446

Collaboration, O.S., 2015. Estimating the reproducibility of psychological science. Science 349. https://doi.org/10.1126/science.aac4716

Hoenig, J.M., Heisey, D.M., 2001. The Abuse of Power. The American Statistician 55, 19–24. https://doi.org/10.1198/000313001300339897

Ioannidis, J.P.A., 2005. Why Most Published Research Findings Are False. PLOS Medicine 2, e124. https://doi.org/10.1371/journal.pmed.0020124

Nakagawa, S., 2004. A farewell to Bonferroni: the problems of low statistical power and publication bias. Behav Ecol 15, 1044–1045. https://doi.org/10.1093/beheco/arh107

Verhoeven, K.J.F., Simonsen, K.L., McIntyre, L.M., 2005. Implementing false discovery rate control: increasing your power. Oikos 108, 643–647. https://doi.org/10.1111/j.0030-1299.2005.13727.x