Consider allowing JavaScript. Otherwise, you have to be proficient in reading LaTeX since formulas will not be rendered. Furthermore, the table of contents in the left column for navigation will not be available and code-folding not supported. Sorry for the inconvenience.

Examples in this article were generated with R 4.0.5 by the package PowerTOST.1

More examples are given in the respective vignette.2 See also the README on GitHub for an overview and the online manual3 for details and a collection of other articles.

  • The right-hand badges give the respective section’s ‘level’.
    
  1. Basics about sample size methodology – requiring no or only limited statistical expertise.
    
  1. These sections are the most important ones. They are – hopefully – easily comprehensible even for novices.
    
  1. A somewhat higher knowledge of statistics and/or R is required. May be skipped or reserved for a later reading.
    
  1. An advanced knowledge of statistics and/or R is required. Not recommended for beginners in particular.
    
  1. If you are not a neRd or statistics afficionado, skipping is recommended. Suggested for experts but might be confusing for others.
  • Click to show / hide R code.
Abbreviation Meaning
BE Bioequivalence
CV Total (pooled) Coefficient of Variation
CVinter Between-subject Coefficient of Variation
CVintra Within-subject Coefficient of Variation
H0 Null hypothesis
H1 Alternative hypothesis (also Ha)
NTID Narrow Therapeutic Index Drug
TOST Two One-Sided Tests

Introduction

    
What are the main statistical issues in planning a confirmatory experiment?

An ‘optimal’ study design is one which (taking all assumptions into account) has a reasonably high chance of demonstrating equivalence (power) whilst controlling the consumer risk.

top of section ↩︎

Preliminaries

    

A basic knowledge of R is required. To run the scripts at least version 1.4.3 (2016-11-01) of PowerTOST is suggested. Any version of R would likely do, though the current release of PowerTOST was only tested with version 3.6.3 (2020-02-29) and later.

Note that in all functions of PowerTOST the arguments (say, the assumed T/R-ratio theta0, the BE-limits (theta1, theta2), the assumed coefficient of variation CV, etc.) have to be given as ratios and not in percent.

sampleN.TOST() gives equally sized groups. Furthermore, the estimated sample size is the total number of subjects, which is always an even number (not subjects per group – like in some other software packages).

Most examples deal with studies where the response variables likely follow a lognormal distribution, i.e., we assume a multiplicative model (ratios instead of differences). We work with \(\small{\log_{e}}\)-transformed data in order to allow analysis by a variant of the t-test (requiring differences). This is the default in most functions of PowerTOST and hence, the argument logscale = TRUE does not need to be specified.

previous section ↩︎

Terminology

    

It may sound picky but ‘sample size calculation’ (as used in most guidelines and alas, in some publications and textbooks) is sloppy terminology. In order to get prospective power (and hence, a sample size), we need five values:

  1. The level of the test (in BE commonly 0.05),
  2. the BE-margins (commonly 0.80 – 1.25),
  3. the desired (or target) power,
  4. the variance (commonly expressed as a coefficient of variation), and
  5. the deviation of the test from the reference treatment.

1 – 2 are fixed by the agency,
3 is set by the sponsor (commonly to 0.80 – 0.90), and
4 – 5 are just (uncertain!) assumptions.

In other words, obtaining a sample size is not an exact calculation like \(\small{2\times2=4}\) but always just an estimation.

Power Calculation – A guess masquerading as mathematics.
Realization: Observations (in a sample) of a random variable (of the population).

Of note, it is extremely unlikely that all assumptions will be exactly realized in a particular study. Hence, calculating retrospective (a.k.a. post hoc, a posteriori) power is not only futile but plain nonsense.8

Contrary to non-inferiority trials (where we want to demonstrate superiority of verum to placebo) or first-in-human studies, in equivalence studies the allocation ratio is always 1 (i.e., we plan for equally sized treatment groups.)

We should not talk about between subject variability (CVinter) in parallel designs unless we have this information from a crossover design.9 What we see in a parallel design is the total variability (pooled from the between subject variability CVinter and the within subject variability CVintra).
The fact that we observed only one occasion does not mean that the within subject variability miraculously disappears.

Calculate the total CV for some combinations of CVintra (0.075 – 0.20) and CVinter (0.20 – 0.40).

CV2mse <- function(CV)  return(log(CV^2 + 1))
mse2CV <- function(mse) return(sqrt(exp(mse) - 1))
CV.b   <- seq(0.20, 0.40, 0.05)
CV.w   <- seq(0.075, 0.20, 0.025)
df     <- data.frame(CV.b = rep(CV.b,
                     each = length(CV.w)),
                     CV.w = CV.w, CV = NA)
for (i in 1:nrow(df)) {
  # mean of variances, converted back to CV
  df$CV[i] <- mse2CV(mean(c(CV2mse(df$CV.b[i]),
                            CV2mse(df$CV.w[i]))))
}
df     <- reshape(df, direction = "wide",
                  idvar = "CV.w", timevar = "CV.b")
names(df)[2:ncol(df)] <- paste0("CV.b.", CV.b)
print(signif(df, 5), row.names = FALSE)
R>   CV.w CV.b.0.2 CV.b.0.25 CV.b.0.3 CV.b.0.35 CV.b.0.4
R>  0.075  0.15056   0.18350  0.21671   0.24991  0.28294
R>  0.100  0.15777   0.18952  0.22190   0.25449  0.28706
R>  0.125  0.16655   0.19697  0.22838   0.26024  0.29226
R>  0.150  0.17667   0.20569  0.23603   0.26708  0.29847
R>  0.175  0.18789   0.21550  0.24474   0.27492  0.30562
R>  0.200  0.20000   0.22625  0.25437   0.28366  0.31363

In general CVintra < CVinter, explaining why crossover studies are so popular.

There is no relationship between CVintra and CVinter. An example are drugs which are subjected to polymorphic metabolism. For these drugs CVinterCVintra.

previous section ↩︎

Power → Sample size

    

The sample size cannot be directly estimated,
only power calculated for an already given sample size.

The power equations cannot be re-arranged to solve for sample size.

Power. That which statisticians are always calculating but never have.
Stephen Senn, Statistical Issues in Drug Development. Wiley, 2nd ed 2007
    

If you are a nerd out in the wild and have only your scientific calculator with you, use the large sample approximation (i.e., based on the normal distribution of \(\small{\log_{e}}\)-transformed data) taking power and \(\small{\alpha}\) into account.10

Convert the coefficient of variation to the variance \[\begin{equation}\tag{1} \sigma^2=\log_{e}(CV^2+1) \end{equation}\] Numbers needed in the following: \(\small{Z_{1-0.2}=0.8416212}\), \(\small{Z_{1-0.1}=1.281552}\), \(\small{Z_{1-0.05}=1.644854}\). \[\begin{equation}\tag{2a} \textrm{if}\;\theta_0<1:n\sim\frac{2\,\sigma^2(Z_{1-\beta}-Z_{1-\alpha})^2}{\left(\log_{e}(\theta_0)-\log_{e}(\theta_1)\right)^2} \end{equation}\] \[\begin{equation}\tag{2b} \textrm{if}\;\theta_0>1:n\sim\frac{2\,\sigma^2(Z_{1-\beta}-Z_{1-\alpha})^2}{\left(\log_{e}(\theta_0)-\log_{e}(\theta_2)\right)^2} \end{equation}\] \[\begin{equation}\tag{2c} \textrm{if}\;\theta_0=1:n\sim\frac{2\,\sigma^2(Z_{1-\beta/2}-Z_{1-\alpha})^2}{\left(\log_{e}(\theta_2)\right)^2} \end{equation}\] Generally you should use \(\small{(2\textrm{a},2\textrm{b})}\). Only if you are courageous, use \(\small{(2\textrm{c})}\). However, I would not recommend that.11

\(\small{(2)}\) gives the sample size (it can be a real number) per group. We have to multiply it by two and round up to the next even integer.

In \(\small{(1-2)}\) we assume that the population variance \(\small{\sigma^2}\) is known, which is never the case. In practice is is uncertain because only its sample estimate \(\small{s^2}\) is known. Therefore, \(\small{(1-2)}\) will always give a sample size which is too low:
For \(\small{CV=0.40}\), \(\small{\theta_0=0.95}\), \(\small{\alpha=0.05}\), \(\small{\beta=0.20}\), and the conventional limits we get 124.29 for the total sample size (rounded up to 126). As we will see later, the correct sample size is 130.

Power: That which is wielded by the priesthood of clinical trials, the statisticians, and a stick which they use to beta their colleagues.
Stephen Senn, Statistical Issues in Drug Development. Wiley, 2004.
    

Since we evaluate studies with a variant of the t-test, we have to take the degrees of freedom (in parallel designs \(\small{df=n-2)}\) into account.
Power depends on the degrees of freedom, which themselves depend on the sample size.

    

Hence, the power equations are used to estimate the sample size in an iterative way.

  • Start with an approximate sample size (from (2) or with a ‘guestimate’),
  • calculate its power (see the methods section), and
    • if it is < our target, increase the sample size until ≥ the target power is reached → stop;
    • if it is > our target, decrease the sample size until we drop below the target power; use the last sample size with power ≥ the target → stop.

Example based on the noncentral t-distribution (warning: 100 lines of code).

make.even <- function(n) { # equal group sizes
  return(as.integer(2 * (n %/% 2 + as.logical(n %% 2))))
}
CV         <- 0.40       # total (pooled) CV
theta0     <- 0.95       # T/R-ratio
theta1     <- 0.80       # lower BE-limit
theta2     <- 1.25       # upper BE-limit
target     <- 0.80       # desired (target) power
alpha      <- 0.05       # level of the test
beta       <- 1 - target # producer's risk
df         <- data.frame(method = character(),
                         iteration = integer(),
                         n = integer(),
                         power = numeric())
s2         <- log(CV^2 + 1)
s          <- sqrt(s2)
Z.alpha    <- qnorm(1 - alpha)
Z.beta     <- qnorm(1 - beta)
Z.beta.2   <- qnorm(1 - beta / 2)
# large sample approximation
if (theta0 == 1) {
  num   <- 2 * s2 * (Z.beta.2 + Z.alpha)^2
  n.grp <- ceiling(num / log(theta2)^2)
} else {
  num <- 2 * s2 * (Z.beta + Z.alpha)^2
  if (theta0 < 1) {
    denom <- (log(theta0) - log(theta1))^2
  } else {
    denom <- (log(theta0) - log(theta2))^2
  }
  n.grp <- ceiling(num / denom)
}
n     <- make.even(2 * n.grp)
power <- pnorm(sqrt((log(theta0) - log(theta2))^2 * n.grp/
              (2 * s2)) - Z.alpha) +
         pnorm(sqrt((log(theta0) - log(theta1))^2 * n.grp/
              (2 * s2)) - Z.alpha) - 1
df[1, 1:4] <- c("large sample", "\u2013",
                n, sprintf("%.6f", power))
# check by central t approximation
# (since the variance is unknown)
t.alpha <- qt(1 - alpha, n - 2)
if (theta0 == 1) {
  num   <- 2 * s2 * (Z.beta.2 + t.alpha)^2
  n.grp <- ceiling(num / log(theta2)^2)
} else {
  num <- 2 * s2 * (Z.beta + t.alpha)^2
  if (theta0 < 1) {
    denom <- (log(theta0) - log(theta1))^2
  } else {
    denom <- (log(theta0) - log(theta2))^2
  }
  n.grp <- ceiling(num / denom)
}
n       <- make.even(2 * n.grp)
t.alpha <- qt(1 - alpha, n - 2)
power <- pnorm(sqrt((log(theta0) - log(theta2))^2 * n.grp/
              (2 * s2)) - t.alpha) +
         pnorm(sqrt((log(theta0) - log(theta1))^2 * n.grp/
              (2 * s2)) - t.alpha) - 1
df[2, 1:4] <- c("central t", "\u2013",
                 n, sprintf("%.6f", power))
i <- 1
if (power < target) { # iterate upwards
  repeat {
    sem   <- sqrt(4 / n) * s
    ncp   <- c((log(theta0) - log(theta1)) / sem,
               (log(theta0) - log(theta2)) / sem)
    power <- diff(pt(c(+1, -1) * qt(1 - alpha,
                                    df = n - 2),
                     df = n - 2, ncp = ncp))
    df[i + 2, 1:4] <- c("noncentral t", i,
                         n, sprintf("%.6f", power))
    if (power >= target) {
      break
    } else {
      i <- i + 1
      n <- n + 2
    }
  }
} else {              # iterate downwards
  repeat {
    sem   <- sqrt(4 / n) * s
    ncp   <- c((log(theta0) - log(theta1)) / sem,
               (log(theta0) - log(theta2)) / sem)
    power <- diff(pt(c(+1, -1) * qt(1 - alpha,
                                    df = n - 2),
                     df = n - 2, ncp = ncp))
    df[i + 2, 1:4] <- c("noncentral t", i,
                        n, sprintf("%.6f", power))
    if (power < target) {
      df <- df[-nrow(df), ]
      break
    } else {
      i <- i + 1
      n <- n - 2
    }
  }
}
print(df, row.names = FALSE)
R>        method iteration   n    power
R>  large sample         – 126 0.795445
R>     central t         – 126 0.791696
R>  noncentral t         1 126 0.791080
R>  noncentral t         2 128 0.797396
R>  noncentral t         3 130 0.803512

    

Now it’s high time to leave Base R behind and continue with PowerTOST.
Makes our life much easier.

library(PowerTOST) # attach it to run the examples

Its sample size functions use a modification of Zhang’s method12 for the first guess. You can unveil the course of iterations with details = TRUE.

sampleN.TOST(CV = 0.40, theta0 = 0.95,
             targetpower = 0.80,
             design = "parallel", details = TRUE)
R> 
R> +++++++++++ Equivalence test - TOST +++++++++++
R>             Sample size estimation
R> -----------------------------------------------
R> Study design: 2 parallel groups 
R> Design characteristics:
R> df = n-2, design const. = 4, step = 2
R> 
R> log-transformed data (multiplicative model)
R> 
R> alpha = 0.05, target power = 0.8
R> BE margins = 0.8 ... 1.25 
R> True ratio = 0.95,  CV = 0.4
R> 
R> Sample size search (ntotal)
R>  n     power
R> 128   0.797396 
R> 130   0.803512 
R> 
R> Exact power calculation with
R> Owen's Q functions.

Only two iterations. In many cases it hits the bull’s eye right away, i.e., already with the first guess.

\(\small{100(1-2\,\alpha)}\) confidence interval of the point estimate within \(\small{\theta_1,\theta_2}\).

Never (ever!) estimate the sample size based on the large sample approximation \(\small{(2)}\).
We evaluate our studies based on a variant of the t-test, right?
The sample size based on the normal distribution may be too small, thus compromising power.

R>                     method   n Delta    RE    power
R>   large sample approx. (2) 126    -4 -3.1% 0.791080
R>  noncentral t-distribution 130    ±0 ±0.0% 0.803512
R>    exact method (Owen’s Q) 130     –   –   0.803512

previous section ↩︎

Examples

    

Throughout the examples by ‘group’ I’m referring to ‘treatment arms’ – not multiple groups within them or multicenter studies. That’s another story.

Most methods of PowerTOST are based on pairwise comparisons. It is up to you to adjust the level of the test alpha if you want to compare more (say, two test treatments v.s. a reference or each of them against one of the others) in order to avoid inflation of the family-wise error rate due to multiplicity. Examples are given further down.

A Simple Case

    

We assume a CV of 0.40, a T/R-ratio of 0.95, and target a power of 0.80.

Since theta0 = 0.95 and targetpower = 0.80 are defaults of the function, we don’t have to give them explicitely. As usual in bioequivalence, alpha = 0.05 is employed (we will assess the study by a \(\small{100(1-2\,\alpha)=90\%}\) confidence interval) and the BE-limits are theta1 = 0.80 and theta2 = 1.25. Since they are also defaults of the function, we don’t have to give them as well. Hence, the call of the function reduces to a one-liner.

sampleN.TOST(CV = 0.40, design = "parallel")
R> 
R> +++++++++++ Equivalence test - TOST +++++++++++
R>             Sample size estimation
R> -----------------------------------------------
R> Study design: 2 parallel groups 
R> log-transformed data (multiplicative model)
R> 
R> alpha = 0.05, target power = 0.8
R> BE margins = 0.8 ... 1.25 
R> True ratio = 0.95,  CV = 0.4
R> 
R> Sample size (total)
R>  n     power
R> 130   0.803512

Sometimes we are not interested in the entire output and want to use only a part of the results in subsequent calculations. We can suppress the output by the argument print = FALSE and assign the result to a data.frame (here df).

df <- sampleN.TOST(CV = 0.40, design = "parallel",
                   print = FALSE)

Although you could access the elements by the number of the column(s), I don’t recommend that since in various functions these numbers are different and hence, difficult to remember.

Let’s retrieve the column names of df:

names(df)
R> [1] "Design"         "alpha"          "CV"            
R> [4] "theta0"         "theta1"         "theta2"        
R> [7] "Sample size"    "Achieved power" "Target power"

Now we can access the elements of df by their names. Note that double square brackets [[…]] have to be used (single ones are used to access elements by their numbers).

df[["Sample size"]]
R> [1] 130
df[["Achieved power"]]
R> [1] 0.803512

If you insist in accessing elements by column-numbers, use single square brackets […].

df[7:8]
R>   Sample size Achieved power
R> 1         130       0.803512

With 130 subjects (65 per group) we achieve the power we desire.

What happens if we have one dropout?

power.TOST(CV = 0.40, design = "parallel",
           n = df[["Sample size"]] - 1)
R> Unbalanced design. n(i)=65/64 assumed.
R> [1] 0.8004552

Still above the 0.80 we desire. However, with two dropouts (128 eligible subjects) we would slightly fall short (0.7974).

Since dropouts are common (especially in studies in patients) it makes sense to include / dose more subjects in order to end up with a number of eligible subjects which is not lower than our initial estimate.

Let us explore that in the next section.

previous section ↩︎

Dropouts

    

We define two supportive functions:

  1. Provide equally sized groups, i.e., any total sample size n will be rounded up to the next even number.
up2even <- function(n) {
  return(as.integer(2 * (n %/% 2 + as.logical(n %% 2))))
}
  1. Provide the adjusted sample size based on the original sample size n, and the anticipated droput-rate do.rate.
nadj <- function(n, do.rate) {
  return(as.integer(up2even(n / (1 - do.rate))))
}

In order to come up with a suggestion we have to anticipate a (realistic!) dropout rate. Note that this not the job of the statistician; ask the Principal Investigator.

It is a capital mistake to theorise before one has data.

Dropout-rate

    

The dropout-rate is calculated from the eligible and dosed subjects
or simply \[\begin{equation}\tag{3} do.rate=1-n_\textrm{eligible}/n_\textrm{dosed} \end{equation}\] Of course, we know it only after the study was performed.

By substituting \(n_\textrm{eligible}\) with the estimated sample size \(n\), providing an anticipated dropout-rate and rearrangement to find the adjusted number of dosed subjects \(n_\textrm{adj}\) we should use \[\begin{equation}\tag{4} n_\textrm{adj}=\;\upharpoonleft n\,/\,(1-do.rate) \end{equation}\] where \(\upharpoonleft\) denotes rounding up to the next even number as implemented in the functions above.

An all too common mistake is to increase the estimated sample size \(n\) by the drop­out-rate according to \[\begin{equation}\tag{5} n_\textrm{adj}=\;\upharpoonleft n\times(1+do.rate) \end{equation}\]

If you used \((5)\) in the past – you are not alone. In a small survey a whooping 29% of respondents reported to use it.13 Consider changing your routine.

There are no routine statistical questions, only questionable statistical routines.

top of section ↩︎ previous section ↩︎

Adjusted Sample Size

    

In the following I specified more arguments to make the function more flexible.
Note that I wrapped the function power.TOST() in suppressMessages(). Otherwise, the function will throw for any odd sample size a message telling us that the design is unbalanced. Well, we know that.

design  <- "parallel"
CV      <- 0.40 # total (pooled) for parallel design
target  <- 0.80 # target (desired) power
theta0  <- 0.95 # assumed T/R-ratio
theta1  <- 0.80 # lower BE limit
theta2  <- 1.25 # upper BE limit
do.rate <- 0.10 # anticipated dropout-rate 10%
df      <- sampleN.TOST(CV = CV, theta0 = theta0,
                        theta1 = theta1,
                        theta2 = theta2,
                        targetpower = target,
                        design = design,
                        print = FALSE)
# calculate the adjusted sample size
n.adj   <- nadj(df[["Sample size"]], do.rate)
# (decreasing) vector of eligible subjects
n.elig  <- n.adj:df[["Sample size"]]
info    <- paste0("Design                  : ",
                  design,
                  "\nAssumed CV              : ",
                  CV,
                  "\nAssumed T/R ratio       : ",
                  theta0,
                  "\nBE limits               : ",
                  paste(theta1, theta2, sep = "\u2026"),
                  "\nTarget (desired) power  : ",
                  target,
                  "\nAnticipated dropout-rate: ",
                  do.rate,
                  "\nEstimated sample size   : ",
                  df[["Sample size"]], " (",
                  df[["Sample size"]]/2, "/group)",
                  "\nAchieved power          : ",
                  signif(df[["Achieved power"]], 4),
                  "\nAdjusted sample size    : ",
                  n.adj, " (",  n.adj/2, "/group)",
                  "\n\n")
# explore the potential outcome for
# an increasing number of dropouts
do.act <- signif((n.adj - n.elig) / n.adj, 4)
df     <- data.frame(dosed    = n.adj,
                     eligible = n.elig,
                     dropouts = n.adj - n.elig,
                     do.act   = do.act,
                     power    = NA)
for (i in 1:nrow(df)) {
  df$power[i] <- suppressMessages(
                   power.TOST(CV = CV,
                              theta0 = theta0,
                              theta1 = theta1,
                              theta2 = theta2,
                              design = design,
                              n = df$eligible[i]))
}
cat(info); print(round(df, 4), row.names = FALSE)
R> Design                  : parallel
R> Assumed CV              : 0.4
R> Assumed T/R ratio       : 0.95
R> BE limits               : 0.8…1.25
R> Target (desired) power  : 0.8
R> Anticipated dropout-rate: 0.1
R> Estimated sample size   : 130 (65/group)
R> Achieved power          : 0.8035
R> Adjusted sample size    : 146 (73/group)
R>  dosed eligible dropouts do.act  power
R>    146      146        0 0.0000 0.8461
R>    146      145        1 0.0068 0.8437
R>    146      144        2 0.0137 0.8413
R>    146      143        3 0.0205 0.8389
R>    146      142        4 0.0274 0.8364
R>    146      141        5 0.0343 0.8339
R>    146      140        6 0.0411 0.8313
R>    146      139        7 0.0480 0.8287
R>    146      138        8 0.0548 0.8261
R>    146      137        9 0.0616 0.8234
R>    146      136       10 0.0685 0.8207
R>    146      135       11 0.0753 0.8180
R>    146      134       12 0.0822 0.8152
R>    146      133       13 0.0890 0.8123
R>    146      132       14 0.0959 0.8094
R>    146      131       15 0.1027 0.8065
R>    146      130       16 0.1096 0.8035

In the worst case (16 dropouts) we end up with the originally estimated sample size of 130. Power preserved, mission accomplished. If we have less dropouts, splendid – we gain power.

If we would have adjusted the sample size acc. to (5) we would have dosed only 144 subjects instead of 146 acc. to (4).
If the anticipated dropout rate of 10% is realized in the study, we would have only 129 eligible subjects and power will be slightly compromised (0.8005 instead of 0.8035). In this example we achieve still more than our target power but the loss might be relevant in other cases.

top of section ↩︎ previous section ↩︎

Post hoc Power

    

As said in the preliminaries, calculating post hoc power is futile.

There is simple intuition behind results like these: If my car made it to the top of the hill, then it is powerful enough to climb that hill; if it didn’t, then it obviously isn’t powerful enough. Retrospective power is an obvious answer to a rather uninteresting question. A more meaningful question is to ask whether the car is powerful enough to climb a particular hill never climbed before; or whether a different car can climb that new hill. Such questions are prospective, not retrospective.

However, sometimes we are interested in it for planning the next study.

If you give and odd total sample size n, power.TOST() will try to keep groups as balanced as possible and show in a message how that was done.

n.act <- 131
signif(power.TOST(CV = 0.40,  design = "parallel",
                  n = n.act), 6)
R> Unbalanced design. n(i)=66/65 assumed.
R> [1] 0.806475

Say, our study was more unbalanced. Let us assume that we dosed 146 subjects, the total number of subjects was also 131 but all drop­outs occured in one group (unlikely but possible).
Instead of the total sample size n we can give the number of subjects of each group as a vector (the order is not relevant, i.e., it does not matter which element refers to the test or reference group).

n.adj    <- 146
n.act    <- 131
n.g1     <- n.adj / 2
n.g2     <- n.act - n.g1
post.hoc <- signif(power.TOST(CV  = 0.40, design = "parallel",
                             n = c(n.g1, n.g2)), 6)
fmt      <- "%3.0f (%2.0f dropouts)"
cat(paste0("Dosed subjects: ", n.adj,
           "\nEligible      : ",
           sprintf(fmt, n.act, n.adj - n.act),
           "\n    Group 1   : ",
           sprintf(fmt, n.g1, n.adj / 2 - n.g1),
           "\n    Group 2   : ",
           sprintf(fmt, n.g2, n.adj / 2 - n.g2),
           "\nPower         : ", post.hoc, "\n"))
R> Dosed subjects: 146
R> Eligible      : 131 (15 dropouts)
R>     Group 1   :  73 ( 0 dropouts)
R>     Group 2   :  58 (15 dropouts)
R> Power         : 0.801396

Of course, in a particular study you will provide the numbers in the n vector directly.

top of section ↩︎ previous section ↩︎

Lost in Assumptions

    

As stated already in the introduction, the CV and the T/R-ratio are only assumptions. Whatever their origin might be (literature, previous studies) they carry some degree of uncertainty. Hence, believing14 that they are the true ones may be risky.
Some call that the ‘Carved-in-Stone’ approach.

Say, we performed a pilot study in 64 subjects and estimated the CV as 0.40.

The \(\alpha\) confidence interval of the CV is obtained via the \(\chi^2\)-distribution of its error variance \(\sigma^2\) with \(\small{n-2}\) degrees of freedom. \[\begin{matrix}\tag{6} s^2=\log_{e}(CV^2+1)\\ L=\frac{(n-1)\,s^2}{\chi_{\alpha/2,\,n-2}^{2}}\leq\sigma^2\leq\frac{(n-1)\,s^2}{\chi_{1-\alpha/2,\,n-2}^{2}}=U\\ \left\{lower\;CL,\;upper\;CL\right\}=\left\{\sqrt{\exp(L)-1},\sqrt{\exp(U)-1}\right\} \end{matrix}\]

Let’s calculate the 95% confidence interval of the CV to get an idea.

m  <- 64 # pilot study
df <- CVCL(CV = 0.40, df = m - 2,
           side = "2-sided", alpha = 0.05)
signif(df, 4)
R> lower CL upper CL 
R>   0.3368   0.4941

Surprised? Although 0.40 is the best estimate for planning the next study, there is no guarantee that we will get exactly the same outcome. Since the \(\chi^2\)-distribution is skewed to the right, it is more likely to get a higher CV than a lower one in the planned study.

If we plan the study based on 0.40, we would opt for 130 subjects like in the examples before (not adjusted for the dropout-rate).
If the CV will be lower, we gain power. Fine.
But what if it will be higher? Of course, we will loose power. But how much?

Let’s explore what might happen at the confidence limits of the CV.

m    <- 64
df   <- CVCL(CV = 0.40, df = m - 2,
             side = "2-sided", alpha = 0.05)
n    <- 130
comp <- data.frame(CV = c(df[["lower CL"]], 0.40,
                          df[["upper CL"]]),
                   power = NA)
for (i in 1:nrow(comp)) {
  comp$power[i] <- power.TOST(CV = comp$CV[i],
                              design = "parallel",
                              n = n)
}
comp[, 1] <- signif(comp[, 1], 4)
comp[, 2] <- signif(comp[, 2], 6)
print(comp, row.names = FALSE)
R>      CV    power
R>  0.3368 0.906998
R>  0.4000 0.803512
R>  0.4941 0.624098

Ouch!

What can we do? The larger the previous study was, the larger the degrees of freedom and hence, the narrower the confidence interval. In simple terms: The estimate is more certain. On the other hand, it also means that very small pilot studies are practically useless.

m               <- seq(8, 64, 8)
df              <- data.frame(n = m, CV = 0.40,
                              l = NA, u = NA)
for (i in 1:nrow(df)) {
  df[i, 3:4] <- CVCL(CV = 0.40, df = m[i] - 2,
                     side = "2-sided",
                     alpha = 0.05)
}
df[, 3:4]      <- signif(df[, 3:4], 4)
names(df)[3:4] <- c("lower CL", "upper CL")
print(df, row.names = FALSE)
R>   n  CV lower CL upper CL
R>   8 0.4   0.2521   1.0270
R>  16 0.4   0.2878   0.6682
R>  24 0.4   0.3047   0.5884
R>  32 0.4   0.3153   0.5511
R>  40 0.4   0.3228   0.5287
R>  48 0.4   0.3285   0.5136
R>  56 0.4   0.3330   0.5026
R>  64 0.4   0.3368   0.4941
    

A Bayesian method is implemented in PowerTOST, which takes the uncertainty of estimates into account. We can explore the uncertainty of the CV, the T/R-ratio, and both.

m      <- 64   # sample size of pilot
CV     <- 0.40
theta0 <- 0.95
design <- "parallel"
df     <- expsampleN.TOST(CV = 0.4, theta0 = theta0,
                          targetpower = 0.80,
                          design = design,
                          prior.parm = list(
                            m = m, design = design),
                          prior.type = "CV",
                          details = FALSE)
R> 
R> ++++++++++++ Equivalence test - TOST ++++++++++++
R>        Sample size est. with uncertain CV
R> -------------------------------------------------
R> Study design:  2 parallel groups 
R> log-transformed data (multiplicative model)
R> 
R> alpha = 0.05, target power = 0.8
R> BE margins = 0.8 ... 1.25 
R> Ratio = 0.95
R> CV = 0.4 with 62 df
R> 
R> Sample size (ntotal)
R>  n   exp. power
R> 134   0.802364

Doesn’t really matter. Only 3% more subjects required.

What about an uncertain T/R-ratio?

m      <- 64
CV     <- 0.40
theta0 <- 0.95
design <- "parallel"
df     <- expsampleN.TOST(CV = 0.4, theta0 = theta0,
                          targetpower = 0.80,
                          design = design,
                          prior.parm = list(
                            m = m, design = design),
                          prior.type = "theta0",
                          details = FALSE)
R> 
R> ++++++++++++ Equivalence test - TOST ++++++++++++
R>      Sample size est. with uncertain theta0
R> -------------------------------------------------
R> Study design:  2 parallel groups 
R> log-transformed data (multiplicative model)
R> 
R> alpha = 0.05, target power = 0.8
R> BE margins = 0.8 ... 1.25 
R> Ratio = 0.95
R> CV = 0.4
R> 
R> Sample size (ntotal)
R>  n   exp. power
R> 314   0.800778

Just terrible. The sample size more than doubles.

That should not take us by surprise. We don’t know where the true T/R-ratio lies but we can calculate the lower 95% confidence limit of the pilot study’s point estimate to get an idea about a worst case.

m      <- 64
CV     <- 0.40
pe     <- 0.95
design <- "parallel"
ci     <- round(CI.BE(CV = CV, pe = 0.95, n = m,
                      design = design), 4)
if (pe <= 1) {
  cl <- ci[["lower"]]
} else {
  cl <- ci[["upper"]]
}
print(cl)
R> [1] 0.8089

Explore the impact of a relatively 5% higher CV and a relatively 5% lower T/R-ratio on power for a given sample size.

n      <- 130
CV     <- 0.40
theta0 <- 0.95
design <- "parallel"
comp1  <- data.frame(CV = c(CV, CV*1.05),
                     power = NA)
comp2  <- data.frame(theta0 = c(theta0, theta0*0.95),
                     power = NA)
for (i in 1:2) {
  comp1$power[i] <- power.TOST(CV = comp1$CV[i],
                               theta0 = theta0,
                               n = n,
                               design = design)
}
comp1$power <- signif(comp1$power, 6)
for (i in 1:2) {
  comp2$power[i] <- power.TOST(CV = CV,
                               theta0 = comp2$theta0[i],
                               n = n,
                               design = design)
}
comp2$power <- signif(comp2$power, 6)
print(comp1, row.names = F); print(comp2, row.names = F)
R>    CV    power
R>  0.40 0.803512
R>  0.42 0.766869
R>  theta0    power
R>  0.9500 0.803512
R>  0.9025 0.550772
    

Interlude 1

Fig. 1 Power curves for n = 130.

Note the log-scale of the x-axis. It demonstrates that power curves are symmetrical around 1 (\(\small{\log_{e}(1)=0}\), where \(\small{\log_{e}(\theta_2)=\left|\log_{e}(\theta_1)\right|}\)) and we will achieve the same power for \(\small{\theta_0}\) and \(\small{1/\theta_0}\) (e.g., for 0.95 and 1.0526). Furthermore, all curves intersect 0.05 at \(\small{\theta_1}\) and \(\small{\theta_2}\), which means that for a true \(\small{\theta_0}\) at one of the limits \(\small{\alpha}\) is strictly controlled.

<nitpick>
  • A common flaw in protocols is the phrase
          »The sample size was calculated [sic] based on a T/R-ratio of 0.95 – 1.05…«
    If you assume a deviation of 5% of the test from the reference and are not sure about its direction (lower or higer than 1), always use the lower T/R-ratio. If you would use the upper T/R-ratio, power would be only preserved down to 1/1.05 = 0.9524.
    Given, sometimes you will need a higher sample size with the lower T/R-ratio. There’s no free lunch.
CV      <- 0.40
delta   <- 0.05 # direction unknown
theta0s <- c(1 - delta, 1 / (1 + delta),
             1 + delta, 1 / (1 - delta))
n       <- sampleN.TOST(CV = CV, theta0 = 1 - delta,
                        design = "parallel",
                        print = FALSE)[["Sample size"]]
comp1   <- data.frame(CV = CV, theta0 = theta0s,
                      base = c(TRUE, rep(FALSE, 3)),
                      n = n, power = NA)
for (i in 1:nrow(comp1)) {
  comp1$power[i] <- power.TOST(CV = CV,
                               theta0 = comp1$theta0[i],
                               design = "parallel",
                               n = n)
}
n       <- sampleN.TOST(CV = CV, theta0 = 1 + delta,
                        design = "parallel",
                        print = FALSE)[["Sample size"]]
comp2   <- data.frame(CV = CV, theta0 = theta0s,
                      base = c(FALSE, FALSE, TRUE, FALSE),
                      n = n, power = NA)
for (i in 1:nrow(comp2)) {
  comp2$power[i] <- power.TOST(CV = CV,
                               theta0 = comp2$theta0[i],
                               design = "parallel",
                               n = n)
}
comp1[, c(2, 5)] <- signif(comp1[, c(2, 5)] , 4)
comp2[, c(2, 5)] <- signif(comp2[, c(2, 5)] , 4)
print(comp1, row.names = F); print(comp2, row.names = F)
R>   CV theta0  base   n  power
R>  0.4 0.9500  TRUE 130 0.8035
R>  0.4 0.9524 FALSE 130 0.8124
R>  0.4 1.0500 FALSE 130 0.8124
R>  0.4 1.0530 FALSE 130 0.8035
R>   CV theta0  base   n  power
R>  0.4 0.9500 FALSE 126 0.7911
R>  0.4 0.9524 FALSE 126 0.8001
R>  0.4 1.0500  TRUE 126 0.8001
R>  0.4 1.0530 FALSE 126 0.7911

</nitpick>

    

Since power is much more sensitive to the T/R-ratio than to the CV, the results obtained with the Bayesian method should be clear now.

Essentially this leads to the murky waters of prospective sensitivity analyses, which will be covered in another article.
An appetizer to show the maximum deviations (CV, T/R-ratio and decreased sample size due to dropouts) which give still a minimum acceptable power of ≥ 0.70:

CV       <- 0.40
theta0   <- 0.95
design   <- "parallel"
target   <- 0.80
minpower <- 0.70
pa       <- pa.ABE(CV = CV, theta0 = theta0,
                   design = design,
                   targetpower = target,
                   minpower = minpower)
change.CV     <- 100*(tail(pa$paCV[["CV"]], 1) -
                      pa$plan$CV) / pa$plan$CV
change.theta0 <- 100*(head(pa$paGMR$theta0, 1) -
                      pa$plan$theta0) /
                      pa$plan$theta0
change.n      <- 100*(tail(pa$paN[["N"]], 1) -
                      pa$plan[["Sample size"]]) /
                      pa$plan[["Sample size"]]
comp     <- data.frame(parameter = c("CV", "theta0", "n"),
                       change = c(change.CV,
                                  change.theta0,
                                  change.n))
comp$change <- sprintf("%+.2f%%", comp$change)
names(comp)[2] <- "relative change"
print(pa, plotit = FALSE); print(comp, row.names = FALSE)
R> Sample size plan ABE
R>    Design alpha  CV theta0 theta1 theta2 Sample size
R>  parallel  0.05 0.4   0.95    0.8   1.25         130
R>  Achieved power
R>        0.803512
R> 
R> Power analysis
R> CV, theta0 and number of subjects leading to min. acceptable power of =0.7:
R>  CV= 0.4552, theta0= 0.9276
R>  n = 103 (power= 0.701)
R>  parameter relative change
R>         CV         +13.79%
R>     theta0          -2.36%
R>          n         -20.77%

Confirms what we have seen above. Interesting that the sample size is the least sensitive one. Many overrate the impact of dropouts on power.


If you didn’t stop reading in desperation (understandable), explore both uncertain CV and T/R-ratio:

m      <- 64
CV     <- 0.40
theta0 <- 0.95
design <- "parallel"
df     <- expsampleN.TOST(CV = 0.4, theta0 = theta0,
                          targetpower = 0.80,
                          design = design,
                          prior.parm = list(
                            m = m, design = design),
                          prior.type = "both",
                          details = FALSE)
R> 
R> ++++++++++++ Equivalence test - TOST ++++++++++++
R>   Sample size est. with uncertain CV and theta0
R> -------------------------------------------------
R> Study design:  2 parallel groups 
R> log-transformed data (multiplicative model)
R> 
R> alpha = 0.05, target power = 0.8
R> BE margins = 0.8 ... 1.25 
R> Ratio = 0.95 with 62 df
R> CV = 0.4 with 62 df
R> 
R> Sample size (ntotal)
R>  n   exp. power
R> 332   0.800366

I don’t suggest that you should use it in practice. AFAIK, not even the main author of this function (Benjamin Lang) does. However, it serves educational purposes to show that it is not that easy and why even properly planned studies might fail.

    

An alternative to assuming a T/R-ratio is based on statistical assurance.15 This concept uses the distribution of T/R-ratios and assumes an uncertainty parameter \(\small{\sigma_\textrm{u}}\). A natural assumption is \(\small{\sigma_\textrm{u}=1-\theta_0}\), i.e., for the commonly applied T/R-ratio of 0.95 one can use the argument sem = 0.05 in the function expsampleN.TOST(), where the argument theta0 has to be fixed at 1.

CV        <- 0.40
theta0    <- 0.95
target    <- 0.80
design    <- "parallel"
sigma.u   <- 1 - theta0
comp      <- data.frame(theta0 = theta0,
                        n.1 = NA, power = NA,
                        sigma.u = sigma.u,
                        n.2 = NA, assurance = NA)
comp[2:3] <- sampleN.TOST(CV = CV,
                          targetpower = target,
                          theta0 = theta0,
                          design = design,
                          details = FALSE,
                          print = FALSE)[7:8]
comp[5:6] <- expsampleN.TOST(CV = CV,
                             theta0 = 1, # fixed!
                             targetpower = target,
                             design = design,
                             prior.type = "theta0",
                             prior.parm =
                               list(sem = sigma.u),
                             details = FALSE,
                             print = FALSE)[9:10]
names(comp)[c(2, 5)]  <- rep("n", 2)
print(signif(comp, 6), row.names = FALSE)
R>  theta0   n    power sigma.u   n assurance
R>    0.95 130 0.803512    0.05 126   0.80161

previous section ↩︎

Multiple Endpoints

    

It is not unusal that equivalence of more than one endpoint has to be demonstrated. In bioequivalence the pharmacokinetic metrics Cmax and AUC0–t are mandatory (in some jurisdictions like the FDA additionally AUC0–∞).

We don’t have to worry about multiplicity issues (inflated Type I Error) since if all tests must pass at level \(\alpha\), we are protected by the intersection-union principle.16 17

We design the study always for the worst case combination, i.e., based on the PK metric requiring the largest sample size. In some jurisdictions wider BE limits for Cmax are acceptable. Let’s explore that with different CVs and T/R-ratios.

metrics <- c("Cmax", "AUCt", "AUCinf")
CV      <- c(0.40, 0.30, 0.33)
theta0  <- c(0.95, 1.07, 1.06)
theta1  <- c(0.75, 0.80, 0.80)
theta2  <- 1 / theta1
design  <- "parallel"
target  <- 0.80
df      <- data.frame(metric = metrics,
                      theta1 = theta1, theta2 = theta2,
                      CV = CV, theta0 = theta0, n = NA)
for (i in 1:nrow(df)) {
  df$n[i] <- sampleN.TOST(CV = CV[i],
                          theta0 = theta0[i],
                          theta1 = theta1[i],
                          theta2 = theta2[i],
                          targetpower = target,
                          design = design,
                          print = FALSE)[["Sample size"]]
}
df$theta1 <- sprintf("%.4f", df$theta1)
df$theta2 <- sprintf("%.4f", df$theta2)
txt       <- paste0("Sample size based on ",
                     df$metric[df$n == max(df$n)], ".\n")
print(df, row.names = FALSE); cat(txt)
R>  metric theta1 theta2   CV theta0  n
R>    Cmax 0.7500 1.3333 0.40   0.95 72
R>    AUCt 0.8000 1.2500 0.30   1.07 90
R>  AUCinf 0.8000 1.2500 0.33   1.06 98
R> Sample size based on AUCinf.

Even if we assume the same T/R-ratio for two PK metrics, we will get a wider margin for the one with lower variability.

Let’s continue with the conditions of our previous examples, this time assuming that the CV and T/R-ratio were applicable for Cmax. As common in PK, the CV of AUC is lower, say only 0.30. That means, the study is ‘overpowered’ for the assumed T/R-ratio of AUC.

    

Which are the extreme T/R-ratios (largest deviations of T from R) giving still the target power?

opt <- function(x) {
  power.TOST(theta0 = x, CV = df$CV[2],
             theta1 = theta1,
             theta2 = theta2,
             n = df$n[1],
             design = design) - target
}
metrics <- c("Cmax", "AUC")
CV      <- c(0.40, 0.30) # Cmax, AUC
theta0  <- 0.95          # both metrics
theta1  <- 0.80
theta2  <- 1.25
design  <- "parallel"
target  <- 0.80
df      <- data.frame(metric = metrics, theta0 = theta0,
                      CV = CV, n = NA, power = NA)
for (i in 1:nrow(df)) {
  df[i, 4:5] <- sampleN.TOST(CV = CV[i],
                             theta0 = theta0,
                             theta1 = theta1,
                             theta2 = theta2,
                             targetpower = target,
                             design = design,
                             print = FALSE)[7:8]
}
df$power <- signif(df$power, 6)
if (theta0 < 1) {
  res <- uniroot(opt, tol = 1e-8,
                 interval = c(theta1 + 1e-4, theta0))
} else {
  res <- uniroot(opt, tol = 1e-8,
                 interval = c(theta0, theta2 - 1e-4))
}
res     <- unlist(res)
theta0s <- c(res[["root"]], 1/res[["root"]])
txt     <- paste0("Target power for ", metrics[2],
                 " and sample size ",
                 df$n[1], "\nachieved for theta0 ",
                 sprintf("%.4f", theta0s[1]), " or ",
                 sprintf("%.4f", theta0s[2]), ".\n")
print(df, row.names = FALSE); cat(txt)
R>  metric theta0  CV   n    power
R>    Cmax   0.95 0.4 130 0.803512
R>     AUC   0.95 0.3  76 0.803123
R> Target power for AUC and sample size 130
R> achieved for theta0 0.9099 or 1.0990.

That means, although we assumed for AUC the same T/R-ratio as for Cmax – with the sample size of 130 required for Cmax – for AUC it can be as low as ~0.91 or as high as ~1.10, which is a soothing side-effect.

\(\small{\lambda_\textrm{z}}\): Apparent terminal elimination rate – con­stant.

Furthermore, sometimes we have less data of AUC than of Cmax (samples at the end of the profile missing or unreliable estimation of \(\small{\lambda_\textrm{z}}\) in some subjects and therefore, less data of AUC0–∞ than of AUC0–t). Again, it will not hurt because for the originally assumed T/R-ratio we need only 76 subjects.

    

Since – as a one-point metric – Cmax is inherently more variable than AUC, Health Canada does not require assessment of its confidence interval – only the point estimate has to lie within 80.0–125.0%. We can explore that by setting alpha = 0.5 for it.18

metrics <- c("Cmax", "AUC")
CV      <- c(1.00, 0.40)
theta0  <- 0.95
design  <- "parallel"
target  <- 0.80
alpha   <- c(0.50, 0.05)
df      <- data.frame(metric = metrics, CV = CV, theta0 = theta0,
                      alpha = alpha, n = NA)
for (i in 1:nrow(df)) {
  df$n[i] <- sampleN.TOST(alpha = alpha[i],
                          CV = CV[i],
                          theta0 = theta0,
                          targetpower = target,
                          design = design,
                          print = FALSE)[["Sample size"]]
}
txt       <- paste0("Sample size based on ",
                     df$metric[df$n == max(df$n)], ".\n")
print(df, row.names = FALSE); cat(txt)
R>  metric  CV theta0 alpha   n
R>    Cmax 1.0   0.95  0.50 102
R>     AUC 0.4   0.95  0.05 130
R> Sample size based on AUC.

Note the extremely high CV. You will hardly ever face a situation where you have to base the sample size on Cmax.

previous section ↩︎

Multiple Treatments

    

In the most simple case one may compare two test treatments to one reference and want to demonstrate equivalence of both. The well-known Bon­ferroni-adjustment can be used \[\begin{equation}\tag{6} \alpha_\textrm{adj}=\alpha\,/\,k \end{equation}\] where \(\small{k}\) is the number of simultaneous tests. The type I error \(\small{TIE}\) will always be controlled because \[\begin{equation}\tag{7} TIE=1-{(1-\alpha_\textrm{adj})}^k \end{equation}\] For \(\small{\alpha=0.05}\) and \(\small{k=2}\) we get \(\small{0.049375<\alpha}\).

Of course, all comparisons in the study should be done by \(\small{100(1-2\,\alpha_\textrm{adj})}\) confidence intervals.

Extending the basic example and using the argument alpha:

k         <- 2        # comparisons
alpha.adj <- 0.05 / k # Bonferroni
sampleN.TOST(alpha = alpha.adj, CV = 0.40, theta0 = 0.95,
             targetpower = 0.80, design = "parallel")
R> 
R> +++++++++++ Equivalence test - TOST +++++++++++
R>             Sample size estimation
R> -----------------------------------------------
R> Study design: 2 parallel groups 
R> log-transformed data (multiplicative model)
R> 
R> alpha = 0.025, target power = 0.8
R> BE margins = 0.8 ... 1.25 
R> True ratio = 0.95,  CV = 0.4
R> 
R> Sample size (total)
R>  n     power
R> 162   0.800136

The sample size increased by ~25% from the 130 subjects required in a single comparison.

If all treatments should be tested for equivalence (say, T1 v.s. R, T2 v.s. R, and T2 v.s. T1), naturally the sample size increases further.

k         <- 3        # comparisons
alpha.adj <- 0.05 / k # Bonferroni
sampleN.TOST(alpha = alpha.adj, CV = 0.40, theta0 = 0.95,
             targetpower = 0.80, design = "parallel")
R> 
R> +++++++++++ Equivalence test - TOST +++++++++++
R>             Sample size estimation
R> -----------------------------------------------
R> Study design: 2 parallel groups 
R> log-transformed data (multiplicative model)
R> 
R> alpha = 0.01666667, target power = 0.8
R> BE margins = 0.8 ... 1.25 
R> True ratio = 0.95,  CV = 0.4
R> 
R> Sample size (total)
R>  n     power
R> 182   0.801565
    

With an increasing number of comparisons, the Bon­ferroni-adjustment quickly becomes overly conservative. Another issue is that the tests are not strictly independent (at least when we compare different tests to the same reference treatment some – unknown – correlation exists). More powerful methods (Holm, Hochberg, …) are out of scope of this article.

previous section ↩︎

NTIDs

    

So far we employed the common (and hence, default) BE-limits theta1 = 0.80 and theta2 = 1.25. In some jurisdictions tighter limits of 90.00 – 111.11% have to be used.19 What if we have to deal with a NTID in oncology (a study in patients)?

You have to provide only the lower BE-limit theta1 (the upper one will be automatically calculated).

sampleN.TOST(CV = 0.40, design = "parallel",
             theta1 = 0.90)
R> 
R> +++++++++++ Equivalence test - TOST +++++++++++
R>             Sample size estimation
R> -----------------------------------------------
R> Study design: 2 parallel groups 
R> log-transformed data (multiplicative model)
R> 
R> alpha = 0.05, target power = 0.8
R> BE margins = 0.9 ... 1.111111 
R> True ratio = 0.95,  CV = 0.4
R> 
R> Sample size (total)
R>  n     power
R> 1258   0.800289

Terrible.

Interlude 2

Health Canada requires for NTIDs (termed by HC ‘critical dose drugs’) that the confidence interval of Cmax lies within 80.0–125.0% and the one of AUC within 90.0–112.0%.

<nitpick>
  • 100(1/0.9) ≠ 112.0, right? I always thought that it’s 111.11…
    On the average generic products will be approved with 100√0.9×1.12 ≈ 100.4%
    When asked, the reply was: »These numbers are more easy to remember.«

</nitpick>

Hence, for Health Canada you have to set both limits.

sampleN.TOST(CV = 0.40, design = "parallel",
             theta1 = 0.90, theta2 = 1.12)
R> 
R> +++++++++++ Equivalence test - TOST +++++++++++
R>             Sample size estimation
R> -----------------------------------------------
R> Study design: 2 parallel groups 
R> log-transformed data (multiplicative model)
R> 
R> alpha = 0.05, target power = 0.8
R> BE margins = 0.9 ... 1.12 
R> True ratio = 0.95,  CV = 0.4
R> 
R> Sample size (total)
R>  n     power
R> 1258   0.800289

However, only rarely a bit lower than with the common limits for NTIDs (here we get the same sample size).

The FDA requires for NTIDs tighter batch-release specifications (±5% instead of ±10%). Let’s hope that your product complies and the T/R-ratio will be closer to 1:

sampleN.TOST(CV = 0.40, design = "parallel",
             theta0 = 0.975, # ‘better’ T/R-ratio
             theta1 = 0.90)
R> 
R> +++++++++++ Equivalence test - TOST +++++++++++
R>             Sample size estimation
R> -----------------------------------------------
R> Study design: 2 parallel groups 
R> log-transformed data (multiplicative model)
R> 
R> alpha = 0.05, target power = 0.8
R> BE margins = 0.9 ... 1.111111 
R> True ratio = 0.975,  CV = 0.4
R> 
R> Sample size (total)
R>  n     power
R> 588   0.801308

Substantially lower sample size but still not nice.

Luckily most NTIDs do not show a large variability (specifically within subjects). However, this does not help much because in a parallel design the total CV consists of both within and between subject components. Let’s be optimistic and assume a lower CV.

sampleN.TOST(CV = 0.30, design = "parallel",
             theta0 = 0.975,
             theta1 = 0.90)
R> 
R> +++++++++++ Equivalence test - TOST +++++++++++
R>             Sample size estimation
R> -----------------------------------------------
R> Study design: 2 parallel groups 
R> log-transformed data (multiplicative model)
R> 
R> alpha = 0.05, target power = 0.8
R> BE margins = 0.9 ... 1.111111 
R> True ratio = 0.975,  CV = 0.3
R> 
R> Sample size (total)
R>  n     power
R> 342   0.801324

Well… Without a reasonably large pilot study (were we estimate the CV) such an assumption would be pretty risky – it may fire back. What if we performed the study in 342 patients and our assumption was too optimistic (say, the CV turned out to be 0.35)?

power.TOST(CV = 0.35, design = "parallel",
           theta0 = 0.975,
           theta1 = 0.90,
           n = 342)
R> [1] 0.6727196

Hardly higher than ‘two dozen’ or ‘two colums’ bets in Roulette \(\left(\small{\frac{2\,\times\,12}{1\,+\,36}\sim0.65}\right)\). I hope you don’t want to go there…

previous section ↩︎

Difference of Means

    

Sometimes we are interested in assessing differences of responses and not their ratios. In such a case we have to set logscale = FALSE. The limits theta1 and theta2 can be expressed in the following ways:

  • As a difference of means relative to the same (underlying) reference mean.
  • In units of the difference of means.

Note that in the former case the units of CV, and theta0 need also to be given relative to the reference mean (specified as ratio).

Let’s estimate the sample size for an equivalence trial of two blood pressure lowering drugs assessing the difference in means of untransformed data (raw, linear scale). In this setup everything has to be given with the same units (i.e., here the assumed difference –5 mm Hg and the lower / upper limits –15 mm Hg / +15 mm Hg systolic blood pressure). Furthermore, we assume a CV of 25 mm Hg.

sampleN.TOST(CV = 20, theta0 = -5,
             theta1 = -15, theta2 = +15,
             logscale = FALSE, design = "parallel")
R> 
R> +++++++++++ Equivalence test - TOST +++++++++++
R>             Sample size estimation
R> -----------------------------------------------
R> Study design: 2 parallel groups 
R> untransformed data (additive model)
R> 
R> alpha = 0.05, target power = 0.8
R> BE margins = -15 ... 15 
R> True diff. = -5,  CV = 20
R> 
R> Sample size (total)
R>  n     power
R> 102   0.805523

Sometimes in the literature we find not the CV but the standard deviation of the difference. Say, it is given with 36 mm Hg. We have to convert it to a CV. In a parallel design it’s simply half of the SD.

SD.delta <- 36
sampleN.TOST(CV = SD.delta / 2, theta0 = -5,
             theta1 = -15, theta2 = +15,
             logscale = FALSE, design = "parallel")
R> 
R> +++++++++++ Equivalence test - TOST +++++++++++
R>             Sample size estimation
R> -----------------------------------------------
R> Study design: 2 parallel groups 
R> untransformed data (additive model)
R> 
R> alpha = 0.05, target power = 0.8
R> BE margins = -15 ... 15 
R> True diff. = -5,  CV = 18
R> 
R> Sample size (total)
R>  n     power
R> 82   0.801691

Note that other software packages (e.g., PASS, nQuery, StudySize, …) require the standard deviation of the difference as input.

previous section ↩︎

Ratio of Means

    

It would be a fundamental flaw if responses are assumed to a follow a normal distribution and not – like above – assess their difference and at the end instead giving an estimate of the difference, calculate the ratio of the test- and reference-means.

In such a case Fieller’s (‘fiducial’) confidence interval20 has to be employed.
Note that in the functions sampleN.RatioF()and power.RatioF() alpha = 0.025 is the default, since it is intended for studies with clinical endpoints, where the 95% confidence interval is usually used for equivalence testing.21

sampleN.RatioF(CV = 0.40, theta0 = 0.95,
               design = "parallel")
R> 
R> +++++++++++ Equivalence test - TOST +++++++++++
R>     based on Fieller's confidence interval
R>             Sample size estimation
R> -----------------------------------------------
R> Study design: 2-group parallel
R> Ratio of means with normality on original scale
R> alpha = 0.025, target power = 0.8
R> BE margins = 0.8 ... 1.25 
R> True ratio = 0.95,  CV = 0.4
R> 
R> Sample size
R>  n     power
R> 188   0.801068

previous section ↩︎

Methods

He who seeks for methods without having a definite problem in mind seeks in the most part in vain.
    

With a few exceptions (i.e., simulation-based methods), in PowerTOST the default method = "exact" implements Owen’s Q function22 which is also used in SAS’ Proc Power.

Proc Power;
  twosamplemeans
    test      = equiv_ratio
    power     = .
    alpha     = 0.05
    meanratio = 1.0
    npergroup = 450
    cv        = 1.5
    lower     = 0.80
    upper     = 1.25
  ;
run;

If you run the macro in SAS (at least v9.1 is required), you should get 0.84896.23

Other implemented methods are "mvt" (based on the bivariate non-central t-distribution), "noncentral" / "nct" (noncentral t-distribution), and "shifted" / "central" (shifted central t-distribution). Although "mvt" is also exact, it may have a somewhat lower precision compared to Owen’s Q and has a much longer run-time.

Let’s compare them.

methods <- c("Proc Power",
             "exact", "mvt", "noncentral", "central")
df      <- data.frame(method = methods,
                      power = c(0.84896, rep(NA, 4)))
for (i in 2:nrow(df)) {
  df$power[i] <- power.TOST(design   = "parallel",
                            logscale = TRUE,
                            alpha    = 0.05,
                            theta0   = 1.0,
                            n        = c(450, 450),
                            CV       = 1.5,
                            theta1   = 0.80,
                            theta2   = 1.25,
                            method   = df$method[i])
}
df$power <- round(df$power, 5)
print(df, row.names = FALSE)
R>      method   power
R>  Proc Power 0.84896
R>       exact 0.84896
R>         mvt 0.84897
R>  noncentral 0.84896
R>     central 0.84879

Power approximated by the shifted central t-distribution is generally slightly lower compared to the others. Hence, if used in sample size estimations, occasionally two more subjects are ‘required’ – which is not correct as demonstrated by the exact method.

methods <- c("exact", "mvt", "noncentral", "central")
df      <- data.frame(method = methods)
for (i in 1:nrow(df)) {
  tmp <- sampleN.TOST(CV = 0.40, theta0 = 0.90,
                      design = "parallel",
                      method = df$method[i],
                      targetpower = 0.80, print = FALSE)
  df$n[i]           <- tmp[["Sample size"]]
  df$power[i]       <- tmp[["Achieved power"]]
  df$power.exact[i] <- power.TOST(CV = 0.40,
                                  theta0 = 0.90,
                                  design = "parallel",
                                  n = df$n[i],
                                  method = "exact")
}
print(df, row.names = FALSE)
R>      method   n     power power.exact
R>       exact 266 0.8000762   0.8000762
R>         mvt 266 0.8000775   0.8000762
R>  noncentral 266 0.8000762   0.8000762
R>     central 268 0.8024811   0.8026911

Particularly for ‘nice’ combinations of the CV and T/R-ratio the relative increase in sample sizes (and hence, study costs) might be relevant.

set.seed(123456)
nsims  <- 1e4
CV     <- signif(runif(nsims, min = 0.25, max = 0.60), 2)
theta0 <- signif(runif(nsims, min = 0.85, max = 0.95), 2)
df     <- data.frame(CV = CV, theta0 = theta0,
                     exact = NA, shifted = NA,
                     diff = NA, delta.n = FALSE)
df   <- unique(df)
for (i in 1:nrow(df)) {
  df$exact[i]   <- sampleN.TOST(CV = df$CV[i],
                                theta0 = df$theta0[i],
                                design = "parallel",
                                method = "exact",
                                print = FALSE)[["Sample size"]]
  df$shifted[i] <- sampleN.TOST(CV = df$CV[i],
                                theta0 = df$theta0[i],
                                design = "parallel",
                                method = "shifted",
                                print = FALSE)[["Sample size"]]
  if (df$shifted[i] != df$exact[i]) df$delta.n[i] <- TRUE
}
df$diff <- 100 * (df$shifted - df$exact) / df$exact
df      <- df[with(df, order(-diff, theta0, CV)), ]
df$diff <- sprintf("%+.3f%%", df$diff)
print(df[which(df$delta.n == TRUE), 1:5], row.names = FALSE)
R>    CV theta0 exact shifted    diff
R>  0.27   0.95    62      64 +3.226%
R>  0.28   0.94    74      76 +2.703%
R>  0.28   0.93    84      86 +2.381%
R>  0.32   0.93   108     110 +1.852%
R>  0.28   0.91   114     116 +1.754%
R>  0.26   0.90   118     120 +1.695%
R>  0.28   0.90   136     138 +1.471%
R>  0.34   0.92   140     142 +1.429%
R>  0.45   0.95   160     162 +1.250%
R>  0.37   0.92   164     166 +1.220%
R>  0.51   0.95   200     202 +1.000%
R>  0.53   0.95   214     216 +0.935%
R>  0.48   0.93   228     230 +0.877%
R>  0.30   0.88   236     238 +0.847%
R>  0.51   0.93   254     256 +0.787%
R>  0.59   0.95   258     260 +0.775%
R>  0.48   0.92   264     266 +0.758%
R>  0.40   0.90   266     268 +0.752%
R>  0.53   0.93   272     274 +0.735%
R>  0.49   0.92   274     276 +0.730%
R>  0.50   0.92   284     286 +0.704%
R>  0.35   0.88   316     318 +0.633%
R>  0.49   0.91   322     324 +0.621%
R>  0.31   0.87   324     326 +0.617%
R>  0.50   0.91   334     336 +0.599%
R>  0.52   0.91   358     360 +0.559%
R>  0.34   0.87   386     388 +0.518%
R>  0.46   0.88   524     526 +0.382%
R>  0.46   0.87   676     678 +0.296%
R>  0.54   0.88   698     700 +0.287%
R>  0.56   0.88   744     746 +0.269%
R>  0.51   0.87   814     816 +0.246%
R>  0.53   0.86  1172    1174 +0.171%

Therefore, I recommend to use "method = shifted" / "method = central" only for comparing with old results (literature, own studies).

previous section ↩︎

Q & A

  • Q: Can we use R in a regulated environment and is PowerTOST validated?
    A: About the acceptability of Base R see ‘A Guidance Document for the Use of R in Regulated Clinical Trial Environments’.

    The authors of PowerTOST tried to do their best to provide reliable and valid results. The ‘NEWS’-file on CRAN documents the development of the package, bug-fixes, and introduction of new methods.
    Validation of any software (yes, of SAS as well…) lies in the hands of the user. The package contains sample data.frames from the literature24 (show them by ctSJ.VIII.10 and ctSJ.VIII.20). Execute the script test_parallel.R which can be found in the /tests sub-directory of the package to reproduce them. If results don’t agree, something went wrong. Do you use a version <0.9.3 of 2012-02-13? If yes, update to the current one.

    You can also reproduce Table 7.6 of Julious (2010, p 119–20).25

txt <- paste("True Mean Ratios 0.80\u20131.20",
             "\n90% power by the noncentral t-distribution",
             "\n  Levels of Bioequivalence",
             "\n    10% -> 90.00\u2013111.11%",
             "\n    15% -> 85.00\u2013117.65%",
             "\n    20% -> 80.00\u2013125.00%",
             "\n    25% -> 75.00\u2013133.33%",
             "\n    30% -> 70.00\u2013142.86%",
             "\nSample sizes (per arm)\n")
tryCatch.W.E <- function(expr) {
  ##' Catch *and* save both errors and warnings,
  ##' and in the case of a warning, also keep the
  ##' computed result.
  ##'
  ##' @title tryCatch both warnings (with value) and
  ##' errors
  ##' @param expr an \R expression to evaluate
  ##' @return a list with 'value' and 'warning', where
  ##'   'value' may be an error caught.
  ##' @author Martin Maechler;
  ##' Copyright (C) 2010-2012  The R Core Team
  W <- NULL
  w.handler <- function(w) { # warning handler
    W <<- w
    invokeRestart("muffleWarning")
  }
  list(value = withCallingHandlers(
         tryCatch(expr, error = function(e) e),
                  warning = w.handler),
         warning = W)
}
fun <- function(CV, theta0, theta1) {
  return(
    sampleN.TOST(CV = CV, theta0 = theta0,
                 theta1 = theta1,
                 targetpower = 0.90,
                 design = "parallel",
                 method = "noncentral",
                 print = FALSE)[["Sample size"]])
}
CV     <- seq(0.30, 0.65, 0.05)
ratio  <- seq(0.80, 1.20, 0.05)
theta1 <- seq(0.90, 0.70, -0.05)
df     <- data.frame(CV = rep(CV, each = length(ratio)),
                     Ratio = ratio, n10 = NA, n15 = NA,
                     n20 = NA, n25 = NA, n30 = NA)
for (i in 1:nrow(df)) {
  for (j in 3:ncol(df)) {
    # needed to continue if ratio at or outside limits
    # (otherwise, stops with error)
    x <- tryCatch.W.E(fun(CV = df$CV[i],
                          theta0 = df$Ratio[i],
                          theta1 = theta1[j-2]))
    if (is.numeric(x$value)) df[i, j] <- x$value
  }
}
df[, 3:7]     <- df[, 3:7] / 2
df[df > 3623] <- NA
for (j in 3:ncol(df)) {
  df[, j] <- formatC(df[, j], format = "f",
                     digits = 0, big.mark=",")
}
df[(df == "NA")] <- ""
df$CV            <- df$CV*100
names(df)[c(1, 3:7)] <- c("CV (%)",
                          paste0(seq(10, 30, 5), "%"))
cat(txt); print(df, row.names = FALSE)
R> True Mean Ratios 0.80–1.20 
R> 90% power by the noncentral t-distribution 
R>   Levels of Bioequivalence 
R>     10% -> 90.00–111.11% 
R>     15% -> 85.00–117.65% 
R>     20% -> 80.00–125.00% 
R>     25% -> 75.00–133.33% 
R>     30% -> 70.00–142.86% 
R> Sample sizes (per arm)
R>  CV (%) Ratio   10%   15%   20%   25% 30%
R>      30  0.80                     356  84
R>      30  0.85               403    95  40
R>      30  0.90         453   108    46  25
R>      30  0.95   506   121    51    28  18
R>      30  1.00   169    72    39    24  16
R>      30  1.05   462   115    50    28  17
R>      30  1.10         328    92    41  23
R>      30  1.15       2,851   213    69  33
R>      30  1.20               887   134  50
R>      35  0.80                     476 112
R>      35  0.85               540   128  54
R>      35  0.90         607   144    61  33
R>      35  0.95   678   161    69    37  23
R>      35  1.00   226    96    51    31  21
R>      35  1.05   620   154    67    37  23
R>      35  1.10         439   122    55  30
R>      35  1.15               286    92  43
R>      35  1.20             1,189   179  66
R>      40  0.80                     611 144
R>      40  0.85               693   163  69
R>      40  0.90         779   184    78  41
R>      40  0.95   871   207    88    48  30
R>      40  1.00   291   123    66    40  26
R>      40  1.05   796   198    85    47  29
R>      40  1.10         564   157    70  38
R>      40  1.15               367   117  55
R>      40  1.20             1,527   230  85
R>      45  0.80                     759 178
R>      45  0.85               861   203  85
R>      45  0.90         968   229    96  51
R>      45  0.95 1,082   257   109    59  36
R>      45  1.00   361   152    81    49  33
R>      45  1.05   988   245   106    58  36
R>      45  1.10         700   194    87  47
R>      45  1.15               455   146  68
R>      45  1.20             1,896   286 105
R>      50  0.80                     919 216
R>      50  0.85             1,041   245 103
R>      50  0.90       1,171   277   116  62
R>      50  0.95 1,309   310   131    71  44
R>      50  1.00   436   184    98    60  39
R>      50  1.05 1,195   297   128    70  43
R>      50  1.10         847   235   104  57
R>      50  1.15               551   176  82
R>      50  1.20             2,295   345 127
R>      55  0.80                   1,088 255
R>      55  0.85             1,233   290 121
R>      55  0.90       1,387   327   137  73
R>      55  0.95 1,550   367   155    84  52
R>      55  1.00   516   218   116    70  46
R>      55  1.05 1,416   351   151    82  51
R>      55  1.10       1,003   278   124  68
R>      55  1.15               652   208  97
R>      55  1.20             2,718   409 150
R>      60  0.80                   1,266 297
R>      60  0.85             1,434   337 141
R>      60  0.90       1,613   381   160  85
R>      60  0.95 1,803   427   180    97  60
R>      60  1.00   601   253   135    82  54
R>      60  1.05 1,647   408   176    96  59
R>      60  1.10       1,167   323   144  78
R>      60  1.15               759   242 113
R>      60  1.20             3,162   476 174
R>      65  0.80                   1,450 340
R>      65  0.85             1,643   386 161
R>      65  0.90       1,849   436   183  97
R>      65  0.95 2,066   489   207   111  68
R>      65  1.00   688   290   154    93  61
R>      65  1.05 1,887   468   201   109  68
R>      65  1.10       1,337   371   164  90
R>      65  1.15               869   277 129
R>      65  1.20             3,623   545 200
  • Q: Shall we throw away our sample size tables?
    A: Not at all. File them in your archives to collect dust. Maybe in the future you will be asked by an agency how you arrived at a sample size. But: Don’t use them any more. What you should not do (and hopefully haven’t done before): Interpolate. Power and therefore, the sample size depends in a highly nonlinear fashion on the five conditions listed above, which makes interpolation of values given in table a nontrivial job.

  • Q: Which of the methods should we use in our daily practice?
    A: method = exact. Full stop. Why rely on approximations? Since it is the default in sampleN.TOST() and power.TOST(), you don’t have to give this argument (saves keystrokes).

  • Q: I fail to understand your example about dropouts. We finish the study with 130 eligible subjects as desired. Why is the dropout-rate ~11% and not the anticipated 10%?
    A: That’s due to rounding up the calculated adjusted sample size (144.4…) to the next even number (146).
    If you manage it to dose fractional subjects (I can’t) your dropout rate would indeed equal the anticipated one: 100(1 – 130/144.4…) = 10%.

  • Q: Do we have to worry about unequal group sizes?
    A: sampleN.TOST() will always give the total number of subjects for equally sized groups.
    If you are interested in post hoc power, give the sample size as a vector, i.e., power.TOST(..., n = c(foo, bar), where foo and bar are the sizes of groups.
    For the evaluation of a study with unequal group sizes see the next Q&A (though that’s not related to PowerTOST).

  • Q: Do we have to worry about unequal variances (heteroscedasticty)?
    A: Not in planning the study. Esp. in parallel designs you will always try to keep the characteristics of groups (in- / exclusion criteria, demographics, metabolic status,…) as similar as possible. It would be a bad idea to have ballerinas in one group and Sumo-wrestlers in the other. However, at the end of the study characteristics might differ indeed. The conventional t-test is sensitive to both unequal group sizes and – to a lesser extent – unequal variances.

    What you should not do in the evaluation: Run any pretest (e.g., an F-test) because it will inflate the Type I Error. Use the Welch-Satterthwaite test26 instead, which – not by any chance – is the default in R, SAS, and other software packages.
    Since you read that far, I’m sure that you don’t use Excel. Even in its current version T.TEST(,,,3) (where 3 specifies heteroscedasticity) rounds the degrees of freedom down to the next integer number – which is wrong – they are real numbers! Hence, the confidence interval will be wider than necessary.

  • Q: Is it possible to simulate power of studies?
    A: That’s not necessary, since the available methods provide analytical solutions. However, if you don’t trust them, simulations are possible with the function power.TOST.sim(), which employs the distributional properties:

\(\small{\sigma^2}\) follows a \(\small{\chi^2}\)-distribution with \(\small{n-2}\) degrees of freedom and \(\small{\log_{e}(\theta_0)}\) follows a normal distribution).27

Convergence takes a while. Empiric power, its standard error, its relative error compared to the exact power, and execution times on a Xeon E3-1245v3 3.4 GHz, 8 MB cache, 16 GB RAM, 64 bit R 4.0.5 on Windows 7:

CV    <- 0.40
des   <- "parallel"
n     <- sampleN.TOST(CV = CV, design = des,
                      print = FALSE)[["Sample size"]]
nsims <- c(1e5, 5e5, 1e6, 5e6, 1e7, 5e7, 1e8)
exact <- signif(power.TOST(CV =CV, n = n,
                           design = des), 5)
df    <- data.frame(simulations = nsims,
                    exact = rep(exact, length(nsims)),
                    simulated = NA, SE = NA, RE = NA)
for (i in 1:nrow(df)) {
  start.time      <- proc.time()[[3]]
  df$simulated[i] <- power.TOST.sim(CV = CV, n = n,
                                    design = design,
                                    nsims = nsims[i])
  df$secs[i]      <- proc.time()[[3]] - start.time
  df$SE[i]        <- sqrt(0.025 / i)
  df$RE           <- 100 * (df$simulated - exact) / exact
}
df$exact       <- signif(df$exact, 5)
df$SE          <- signif(df$RE, 4)
df$RE          <- sprintf("%+.5f%%", df$RE)
df$simulated   <- signif(df$simulated, 5)
df$simulations <- formatC(df$simulations, format = "f",
                          digits = 0, big.mark=",")
names(df)[c(1, 3)] <- c("sim\u2019s", "sim\u2019d")
print(df, row.names = FALSE)
R>        sim’s   exact   sim’d         SE        RE   secs
R>      100,000 0.80351 0.80286 -0.0809000 -0.08090%   0.13
R>      500,000 0.80351 0.80263 -0.1095000 -0.10952%   0.49
R>    1,000,000 0.80351 0.80318 -0.0405700 -0.04057%   1.00
R>    5,000,000 0.80351 0.80369  0.0224300 +0.02243%   4.96
R>   10,000,000 0.80351 0.80353  0.0019170 +0.00192%   9.99
R>   50,000,000 0.80351 0.80351 -0.0005028 -0.00050%  50.57
R>  100,000,000 0.80351 0.80351  0.0004368 +0.00044% 100.02

Fig. 2 Empiric power for n = 130.

  • Q: I still have questions. How to proceed?
    A: The preferred method is to register at the BEBA Forum and post your question there (please read its Policy first).
    You can contact me at [email protected]. Be warned – I will charge you for anything beyond most basic questions.

previous section ↩︎

Acknowledgment

David Dubins (Leslie Dan Faculty of Pharmacy, University of Toronto) for helpful comments.

License

CC BY 4.0 Helmut Schütz 2021
1st version February 26, 2021.
Rendered 2021-04-16 16:10:05 CEST by rmarkdown in 2.53 seconds.

Footnotes and References


  1. Labes D, Schütz H, Lang B. PowerTOST: Power and Sample Size for (Bio)Equivalence Studies. 2021-01-18. CRAN.↩︎

  2. Schütz H. Average Bioequivalence. 2021-01-18. CRAN.↩︎

  3. Labes D, Schütz H, Lang B. Package ‘PowerTOST’. January 18, 2021. CRAN.↩︎

  4. If we plan all studies for 80% power, treatments are equivalent, and all assumptions are exactly realized, one out of five studies will fail by pure chance.
    Science is a cruel mistress.↩︎

  5. U.S. FDA, CDER. Guidance for Industry. Statistical Approaches Establishing Bioequivalence. January 2001. APPENDIX C.↩︎

  6. According to the WHO: »The a posteriori power of the study does not need to be calculated. The power of interest is that calculated before the study is conducted to ensure that the adequate sample size has been selected. […] The relevant power is the power to show equivalence within the pre-defined acceptance range.«↩︎

  7. Fuglsang A. Pilot and Repeat Trials as Development Tools Associated with Demonstration of Bioequivalence. AAPS J. 2015; 17(3): 678–83. doi:10.1208/s12248-015-9744-6.↩︎

  8. Hoenig JM, Heisey DM. The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. Am Stat. 2001; 55(1): 19–24. doi:10.1198/000313001300339897.↩︎

  9. Even if we have both variabilities from a crossover study, we have to pool them for the sample size estimation of a parallel design.↩︎

  10. Julious SA. Sample Sizes for Clinical Trials. Chapter 7.3. Boca Raton: CRC Press; 2010.↩︎

  11. Common batch release specifications are ±10% of the declared content. Of course, you will try to find a batch of the reference product which is as close as possible to the test (the EMA recommends ≤5%). However, keep analytical (in)accuracy and (im)precision in mind. Furthermore, the analytical method was only validated for the test. Try to get a CoA from the originator… Good luck. You never can be sure. Only Health Canada allows a dose-adjustment in the evaluation of the study.↩︎

  12. Zhang P. A Simple Formula for Sample Size Calculation in Equivalence Studies. J Biopharm Stat. 2003; 13(3): 529–538. doi:10.1081/BIP-120022772.↩︎

  13. Schütz H. Sample Size Estimation in Bioequivalence. Evaluation. 2020-10-23. BEBA Forum.↩︎

  14. Quoting my late father: »If you believe, go to church.«↩︎

  15. Ring A, Lang B, Kazaroho C, Labes D, Schall R, Schütz H. Sample size determination in bioequivalence studies using statistical assurance. Br J Clin Pharmacol. 2019; 85(10): 2369–77. doi:10.1111/bcp.14055↩︎

  16. Berger RL, Hsu JC. Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets. Stat Sci. 1996; 11(4): 283–302. JSTOR:2246021.↩︎

  17. Zeng A. The TOST confidence intervals and the coverage probabilities with R simulation. March 14, 2014.↩︎

  18. With alpha = 0.5 the test effectively reduces to a pass/fail assessment.↩︎

  19. The FDA recommends Reference-Scaled Average Bioequivalence (RSABE) and a comparison of the within-subject variances of treatments, which requires a fully replicated design. It is unclear what one can do if only a parallel design is feasible (drugs with a very long half-life, studies in patients).↩︎

  20. Fieller EC. Some Problems In Interval Estimation. J Royal Stat Soc B. 1954; 16(2): 175–85. JSTOR:2984043.↩︎

  21. EMEA, CPMP. Points to Consider on Switching between Superiority and Non-Inferiority. London, 27 July 2000. CPMP/EWP/482/99.↩︎

  22. Owen DB. A special case of a bivariate non-central t-distribution. Biometrika. 1965; 52(3/4): 437–46. doi:10.2307/2333696.↩︎

  23. Labes D. Power parallel group: PowerTOST/SAS/PASS. 2015-08-14. BEBA Forum.↩︎

  24. Julious SA. Tutorial in Biostatistics. Sample sizes for clinical trials with Normal data. Stat. Med. 2004; 23(12): 1921–86. doi:10.1002/sim.1783.↩︎

  25. Note that Table 7.6 gives sample sizes for one arm and therefore, we have to divide the total sample sizes obtained by sampleN.TOST() by two. Julious did not report extreme sample sizes (>3,623) and therefore, I dropped them from the output as well.↩︎

  26. If groups sizes and variances are equal, it will reduce to the common t-test.↩︎

  27. Zheng C, Wang J, Zhao L. Testing bioequivalence for multiple formulations with power and sample size calculations. Pharm Stat. 2012; 11(4): 334–41. doi:10.1002/pst.1522.↩︎