Consider allowing JavaScript. Otherwise, you have to be proficient in reading LaTeX since formulas will not be rendered. Furthermore, the table of contents in the left column for navigation will not be available and code-folding not supported. Sorry for the inconvenience.


Examples were generated with R 4.3.1 by the packages PowerTOST1 and randomizeBE.2 More examples are given in the respective vignette.3 See also the README on GitHub for an overview and the Online manual4 for details.

  • The right-hand badges give the respective section’s ‘level’.
    
  1. Basics about sample size methodology – requiring no or only limited statistical expertise.
    
  1. These sections are the most important ones. They are – hopefully – easily comprehensible even for novices.
    
  1. A somewhat higher knowledge of statistics and/or R is required. May be skipped or reserved for a later reading.
    
  1. An advanced knowledge of statistics and/or R is required. Not recommended for beginners in particular.
    
  1. If you are not a neRd or statistics afficionado, skipping is recommended. Suggested for experts but might be confusing for others.
  • Click to show / hide R code.
  • Click on the icon copy icon in the top left corner to copy R code to the clipboard.
Abbreviation Meaning
\(\small{\alpha}\) Nominal level of the test, probability of Type I Error, patient’ risk
\(\small{\beta}\) Probability of Type II Error, producer’s risk (\(\small{1-\pi}\))
BE Bioequivalence
BIBD Balanced Incomplete Block Design
\(\small{CV}\) Coefficient of Variation
\(\small{CV_\textrm{inter}}\) Between-subject \(\small{CV}\)
\(\small{CV_\textrm{intra}}\) Within-subject \(\small{CV}\) (also ISCV)
\(\small{H_0}\) Null hypothesis
\(\small{H_1}\) Alternative hypothesis (also \(\small{H_\textrm{a}}\))
IBD Incomplete Block Design
NTID Narrow Therapeutic Index Drug
\(\small{\pi}\) (Prospective) power
TOST Two One-Sided Tests
    

Introduction

What are the main statistical issues in planning a confirmatory experiment?

For details about inferential statistics and hypotheses in equivalence see another article.

An ‘optimal’ study design is one which (taking all assumptions into account) has a reasonably high chance of demonstrating equivalence (power) whilst controlling the consumer risk.

top of section ↩︎

    

Preliminaries

A basic knowledge of R is required. To run the scripts at least version 1.4.3 (2016-11-01) of PowerTOST and at least version 0.3.3 (2017-03-22) of randomizeBE is suggested. Any version of R would likely do. The current release of PowerTOST was only tested with R 4.1.3 (2022-03-10) and later; the current release of randomizeBE was only tested with R 4.1.1 (2021-08-10)) and later. Scripts were run on a Xeon E3-1245v3 @ 3.40GHz (1/4 cores) 16GB RAM with R 4.3.1 on Windows 7 build 7601, Service Pack 1, Universal C Run­time 10.0.10240.16390.

Note that in all functions of PowerTOST – and scripts using them – the arguments (say, the assumed T/R-ratio theta0, the BE-limits (theta1, theta2), the assumed coefficient of variation CV, the desired (target) power targetpower, the anticipated dropout-rate do.rate, etc.) have to be given as ratios and not in percent.

sampleN.TOST() gives balanced sequences (i.e., the same number of subjects is allocated to each of the sequences). Furthermore, the estimated sample size is the total number of subjects (not subjects per sequence – like in some other software packages).

Higher-Order Designs with three and four treatments are supported.

design internal name df common name
"3x3" 3x3 crossover 2n-4 3×3 Latin Square
"3x6x3" 3x6x3 crossover 2n-4 3×6×3 Williams’ design
"4x4" 4x4 crossover 3n-6 4×4 Latin Square or a Williams’ design

"3x3" denotes the Latin Square (ABC|BCA|CAB).
"3x6x3" denotes the 6-sequence Williams’ design (ABC|ACB|BAC|BCA|CAB|CBA).
"4x4" denotes the Latin Square (ABCD|BCDA|CDAB|DABC) or any of the six possible Williams’ designs with four sequences:

library(randomizeBE)
williams4        <- data.frame(option = NA_integer_,
                               seq.1 = NA_character_, seq.2 = NA_character_,
                               seq.3 = NA_character_, seq.4 = NA_character_)
for (i in 1:50) {
  williams4[i, 2:5] <- williams(4)
}
williams4        <- unique(williams4)
options          <- 1:nrow(williams4)
williams4$option <- 1:nrow(williams4)
print(williams4, row.names = FALSE)
#  option seq.1 seq.2 seq.3 seq.4
#       1  ABDC  BCAD  CDBA  DACB
#       2  ADCB  BCDA  CABD  DBAC
#       3  ACDB  BDCA  CBAD  DABC
#       4  ABCD  BDAC  CADB  DCBA
#       5  ACBD  BADC  CDAB  DBCA
#       6  ADBC  BACD  CBDA  DCAB

df are the degrees of freedom, where n is the sample size.

Most examples deal with studies where the response variables likely follow a log-normal distribution, i.e., we assume a multiplicative model (ratios instead of differences). We work with \(\small{\log_{e}}\)-transformed data in order to allow analysis by the t-test (requiring differences). This is the default in most functions of PowerTOST and hence, the argument logscale = TRUE does not need to be specified.

Higher-Order Crossover Designs are particular because two approaches for the evaluation are possible. The sample size depends on the chosen approach.

Suppose we have a bioequivalence study with three treatments – A, B, and C – and the objective of the study is to make pairwise comparisons among the treatments. Suppose further that treatment C is different in kind from A and B, so that the assumption of homogeneous variance among the three treatments is questionable. One way to do the analyses, under normality assumptions, is Two at a Time – e.g., to test hypotheses about A and B, use only the data from A and B. Another way is All at Once – include the data from all three treatments in a single analysis, making pairwise comparisons within this analysis. If the assumption of homogeneous variance is correct, the All at Once approach will provide more d.f. for estimating the common variance, resulting in increased power. If the variance of C differs from that of A and B, the All at Once approach may have reduced power or an inflated type I error rate, depending on the direction of the difference in variances.
Donald J. Schuirmann (2004)5

Which approach you could / should use is discussed further down in the sections

Two at a Time

All at Once

previous section ↩︎

    

Terminology

It may sound picky but ‘sample size calculation’ (as used in most guidelines and alas, in some publications and textbooks) is sloppy terminology. In order to get prospective power (and hence, a sample size), we need five values:

  1. The level of the test \(\small{\alpha}\) (in BE commonly 0.05),
  2. the BE-margins (commonly 0.80 – 1.25),
  3. the desired (or target) power \(\small{\pi}\),
  4. the variance (commonly expressed as a coefficient of variation), and
  5. the deviation of the test from the reference treatment.

1 – 2 are fixed by the agency,
3 is set by the sponsor (commonly to 0.80 – 0.90), and
4 – 5 are just (uncertain!) assumptions.

In other words, obtaining a sample size is not an exact calculation like \(\small{2\times2=4}\) but always just an estimation.

Power Calculation – A guess masquerading as mathematics.
Stephen Senn (2020)6
Realization: Ob­ser­vations (in a sample) of a random variable (of the population).

As an aside, it is extremely unlikely that all assumptions will be exactly realized in a particular study. Hence, calculating retrospective (a.k.a. post hoc, a posteriori) power is not only futile but plain nonsense.7

Since generally the within-subject variability is lower than the between-subject variability, crossover studies are so popular. The efficiency of a crossover study compared to a parallel study is given by \(\small{\frac{\sigma_\textrm{intra}^2\;+\,\sigma_\textrm{inter}^2}{0.5\,\times\,\sigma_\textrm{intra}^2}}\). If, say, \(\small{\sigma_\textrm{intra}^2=0.5\times\sigma_\textrm{inter}^2}\) in a paralled study we need six times as many subjects than in a crossover to obtain the same power. On the other hand, in a crossover we have two measurements per subject, which makes the parallel study approximately three times more costly.

Of note, there is no relationship between \(\small{CV_\textrm{intra}}\) and \(\small{CV_\textrm{inter}}\).
An example are drugs which are subjected to polymorphic metabolism. For these drugs \(\small{CV_\textrm{intra}\ll CV_\textrm{inter}}\). On the other hand, some HVD(P)s show \(\small{CV_\textrm{intra}>CV_\textrm{inter}}\).

Carryover: A resi­dual effect of a previous period.

However, it is a prerequisite that no carryover from one period to the next exists. Only then the comparison of treatments will be unbiased. For details see another article.8

Subjects have to be in the same physiological state9 throughout the study – guaranteed by sufficiently long wash­out phases. Crossover studies can not only be performed in healthy volunteers but also in patients with a stable disease (e.g., asthma). Studies in patients with an instable disease (e.g., in oncology) must be performed in a pa­ral­lel design (covered in another article).
If crossovers are not feasible (e.g., for drugs with a very long half life), studies could be performed in a parallel design as well.

previous section ↩︎

Power → Sample size

    

The sample size cannot be directly estimated,
only power calculated for an already given sample size.

The power equations cannot be re-arranged to solve for sample size.

Power. That which statisticians are always calculating but never have.
Stephen Senn (2007)10
    

If you are a geek out in the wild and have only your scientific calculator with you, use the large sample approximation (i.e., based on the normal distribution of \(\small{\log_{e}}\)-transformed data) taking power and \(\small{\alpha}\) into account.11

Convert the coefficient of variation to the variance \[\begin{equation}\tag{1} \sigma^2=\log_{e}(CV^2+1) \end{equation}\] Numbers needed in the following: \(\small{Z_{1-0.2}=0.8416212}\), \(\small{Z_{1-0.1}=1.281552}\), \(\small{Z_{1-0.05}=1.644854}\). \[\begin{equation}\tag{2a} \textrm{if}\;\theta_0<1:n\sim\frac{2\,\sigma^2(Z_{1-\beta}+Z_{1-\alpha})^2}{\left(\log_{e}(\theta_0)-\log_{e}(\theta_1)\right)^2} \end{equation}\] \[\begin{equation}\tag{2b} \textrm{if}\;\theta_0>1:n\sim\frac{2\,\sigma^2(Z_{1-\beta}+Z_{1-\alpha})^2}{\left(\log_{e}(\theta_0)-\log_{e}(\theta_2)\right)^2} \end{equation}\] \[\begin{equation}\tag{2c} \textrm{if}\;\theta_0=1:n\sim\frac{2\,\sigma^2(Z_{1-\beta/2}+Z_{1-\alpha})^2}{\left(\log_{e}(\theta_2)\right)^2} \end{equation}\] Generally you should use \(\small{(2\textrm{a},\,}\)\(\small{2\textrm{b})}\). Only if you are courageous, use \(\small{(2\textrm{c})}\). However, I would not recommend that.12

The sample size by \(\small{(2)}\) can be a real number. We have to round it up to the next multiple integer of the number of sequences to obtain balance.

In \(\small{(1-}\)\(\small{2)}\) we assume that the population variance \(\small{\sigma^2}\) is known, which is never the case. In practice is is uncertain because only its estimate (the sample variance \(\small{s^2}\)) is known. Therefore, \(\small{(1-}\)\(\small{2)}\) may give a sample size which is too low:
For \(\small{CV=0.25}\), \(\small{\theta_0=0.95}\), \(\small{\alpha=0.05}\), \(\small{\beta=0.20,}\) and the conventional limits we get 25.383 for the total sample size (rounded up to 30). As we will see later, the correct sample size is also 30.

Power: That which is wielded by the priesthood of clinical trials, the statisticians, and a stick which they use to beta their colleagues.
Stephen Senn (2004)13
    

Since we evaluate studies based on the t-test, we have to take the degrees of freedom (in the 3-treatment designs \(\small{df=2\,n-4}\) and in the 4-treatment designs \(\small{df=3\,n-6}\)) into account.
Power depends on the degrees of freedom, which themselves depend on the sample size.

    

Hence, the power equations are used to estimate the sample size in an iterative way.

  • Start with an approximate sample size (from (2) or with a ‘guestimate’),
  • calculate its power (see the methods section), and
    • if it is < our target, increase the sample size until ≥ the target power is reached → stop;
    • if it is > our target, decrease the sample size until we drop below the target power; use the last sample size with power ≥ the target → stop.

Example based on the noncentral t-distribution (warning: 114 lines of code).

get.sequences <- function(design) {
  known <- known.designs()[, c(2, 5)]
  ns    <- as.integer(known[known$design == design,
                            "steps"])
  return(ns)
}
get.df <- function(design, n) {
  known <- known.designs()[, 2:3]
  dfe   <- known[known$design == design, "df"]
  df    <- as.integer(eval(parse(text = dfe)))
  return(df)
}
make.equal <- function(n, ns) {# equally sized sequences
  return(as.integer(ns * (n %/% ns + as.logical(n %% ns))))
}
CV         <- 0.25       # within-subject CV
theta0     <- 0.95       # T/R-ratio
theta1     <- 0.80       # lower BE-limit
theta2     <- 1.25       # upper BE-limit
target     <- 0.80       # desired (target) power
alpha      <- 0.05       # level of the test
beta       <- 1 - target # producer's risk
design     <- "3x6x3"    # Williams’
ns         <- get.sequences(design)
df         <- data.frame(method = character(),
                         iteration = integer(),
                         n = integer(),
                         power = numeric())
s2         <- log(CV^2 + 1)
s          <- sqrt(s2)
Z.alpha    <- qnorm(1 - alpha)
Z.beta     <- qnorm(1 - beta)
Z.beta.2   <- qnorm(1 - beta / 2)
# large sample approximation
if (theta0 == 1) {
  num   <- 2 * s2 * (Z.beta.2 + Z.alpha)^2
  n.seq <- ceiling(num / log(theta2)^2)
} else {
  num <- 2 * s2 * (Z.beta + Z.alpha)^2
  if (theta0 < 1) {
    denom <- (log(theta0) - log(theta1))^2
}else {
    denom <- (log(theta0) - log(theta2))^2
  }
  n.seq <- ceiling(num / denom)
}
n     <- make.equal(n.seq, ns)
power <- pnorm(sqrt((log(theta0) - log(theta2))^2 * n /
              (2 * s2)) - Z.alpha) +
         pnorm(sqrt((log(theta0) - log(theta1))^2 * n /
              (2 * s2)) - Z.alpha) - 1
df[1, 1:4] <- c("large sample", "\u2013",
                n, sprintf("%.6f", power))
# check by central t approximation
# (since the variance is unknown)
t.alpha <- qt(1 - alpha, get.df(design, n))
if (theta0 == 1) {
  num   <- 2 * s2 * (Z.beta.2 + t.alpha)^2
  n.seq <- ceiling(num / log(theta2)^2)
} else {
  num <- 2 * s2 * (Z.beta + t.alpha)^2
  if (theta0 < 1) {
    denom <- (log(theta0) - log(theta1))^2
}else {
    denom <- (log(theta0) - log(theta2))^2
  }
  n.seq <- ceiling(num / denom)
}
n     <- make.equal(n.seq, ns)
t.alpha <- qt(1 - alpha, get.df(design, n))
power   <- pnorm(sqrt((log(theta0) - log(theta2))^2 * n /
                (2 * s2)) - t.alpha) +
           pnorm(sqrt((log(theta0) - log(theta1))^2 * n /
                (2 * s2)) - t.alpha) - 1
df[2, 1:4] <- c("central t", "\u2013",
                 n, sprintf("%.6f", power))
i <- 1
if (power < target) {# iterate upwards
  repeat {
    sem   <- sqrt(2 / n) * s
    ncp   <- c((log(theta0) - log(theta1)) / sem,
               (log(theta0) - log(theta2)) / sem)
    power <- diff(pt(c(+1, -1) * qt(1 - alpha,
                                    df = get.df(design, n)),
                     df = get.df(design, n), ncp = ncp))
    df[i + 2, 1:4] <- c("noncentral t", i,
                         n, sprintf("%.6f", power))
    if (power >= target) {
      break
  }else {
      i <- i + 1
      n <- n + 2
    }
  }
} else {             # iterate downwards
  repeat {
    sem   <- sqrt(2 / n) * s
    ncp   <- c((log(theta0) - log(theta1)) / sem,
               (log(theta0) - log(theta2)) / sem)
    power <- diff(pt(c(+1, -1) * qt(1 - alpha,
                                    df = get.df(design, n)),
                     df = get.df(design, n), ncp = ncp))
    df[i + 2, 1:4] <- c("noncentral t", i,
                        n, sprintf("%.6f", power))
    if (power < target) {
      df <- df[-nrow(df), ]
      break
  }else {
      i <- i + 1
      n <- n - 2
    }
  }
}
print(df, row.names = FALSE)
#        method iteration  n    power
#  large sample         – 30 0.851271
#     central t         – 30 0.844551
#  noncentral t         1 30 0.843006
#  noncentral t         2 28 0.817694

    

Now it is time to leave Base R behind and continue with PowerTOST. Makes our life much easier.

library(PowerTOST)   # attach libraries
library(randomizeBE) # to run the examples

The sample size functions of PowerTOST use a modification of Zhang’s method14 based on the large sample approximation as the starting value of the iterations. You can unveil the course of iterations with details = TRUE.

sampleN.TOST(CV = 0.25, theta0 = 0.95, design = "3x6x3",
             targetpower = 0.80, details = TRUE)
# 
# +++++++++++ Equivalence test - TOST +++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 3x6x3 crossover 
# Design characteristics:
# df = 2*n-4, design const. = 2, step = 6
# 
# log-transformed data (multiplicative model)
# 
# alpha = 0.05, target power = 0.8
# BE margins = 0.8 ... 1.25 
# True ratio = 0.95,  CV = 0.25
# 
# Sample size search (ntotal)
#  n     power
# 24   0.753428 
# 30   0.843006 
# 
# Exact power calculation with
# Owen's Q functions.

Two iterations. In many cases it hits the bull’s eye right away, i.e., already with the first guess.

Never (ever!) estimate the sample size based on the large sample approximation \(\small{(2)}\).
We evaluate our studies based on the t-test, right?
The sample size based on the normal distribution may be too small, thus compromising power.

#                     method  n Delta    RE    power
#   large sample approx. (2) 30    ±0 ±0.0% 0.843006
#  noncentral t-distribution 30    ±0 ±0.0% 0.843006
#    exact method (Owen’s Q) 30     –   –   0.843006

In this example it works but it may fail in others.

top of section ↩︎ previous section ↩︎

    

Examples

Throughout the examples I’m referring to studies in a single center – not multiple groups within them or multicenter studies. That’s another pot of tea.

Most methods of PowerTOST are based on pairwise comparisons. It is up to you to adjust the level of the test alpha if you want to compare more (say, two test treatments vs a reference or each of them against one of the others) in order to avoid inflation of the family-wise error rate due to multiplicity. Details are given further down.

    

Planned Evaluation

First of all we have have to consider the type of study, its purpose, and how we intend to evaluate it.

  1. A pilot study with candidates vs a reference.
  2. A pivotal study of a test treatment vs references from different regions,
    a pivotal study of test vs reference in fasting and fed state,
    a pivotal study of ODTs administered with and without water.
  3. A pivotal study with candidates vs a reference.

Let’s discuss the cases from above and whether we can evaluate the study by the common \(\small{\alpha=0.05}\) (i.e., with a \(\small{90\%}\) confidence interval) or have to adjust for multiplicity (\(\small{\alpha<0.05 \to\; >90\%}\) CI).

  1. Actually the CI is not relevant at all and therefore, we can use any – reasonably large – pilot study. We are interested in the T/R-ratios and CVs for selecting the ‘best’ candidate. Say, we have two candidate treatments (\(\small{\textrm{C}_1}\), \(\small{\textrm{C}_2}\)) and one reference (\(\small{\textrm{R}}\)).
    We would select the candidate which is closest to the reference (i.e., \(\small{\min \left \{\left|\log_{e}\theta_{\textrm{C}_1}\right|,\left|\log_{e}\theta_{\textrm{C}_2}\right|\right\}}\)) for the pivotal study because it is the most promising one.
    If the deviations of candidates from the reference are similar (i.e., \(\small{\left|\log_{e}\theta_{\textrm{C}_1}\right|\sim\left|\log_{e}\theta_{\textrm{C}_2}\right|}\)), we would select the candidate with lower within-subject variance.
    No multiplicity correction has to be applied.
  2. In the ‘Two at a Time’ approach we obtain not only point estimates of the pairwise comparisons (like in the ‘All at Once’ approach) but also their within-subject variances (e.g., in the comparisons \(\small{\textrm{T}}\) vs \(\small{\textrm{R}_1}\) and \(\small{\textrm{T}}\) vs \(\small{\textrm{R}_2}\)).15
    Since we aim at demonstrating BE for all products (or in all conditions), no multiplicity correction has to be applied.16 17
  3. Sometimes the pilot was indecisive (very similar T/R ratios and CVs) and companies are wary to select one based on ‘gut feelings’ and include both in the pivotal study.
    Submit the ‘better’ one to the authority and stop developing the other.
    In the past opinion was split amongst statisticians:
    1. Since only one product will be marketed, this approach does not increase the patient’s risk (the \(\small{90\%}\) CI is sufficient). I used this argument for many years, nowadays I’m strongly supporting a multiplicity adjustment.
    2. The sponsor gets more chances to show BE and selects one product based on the outcome. This is data-driven, naturally inflates the Type I Error and hence, a multiplicity adjustment (e.g., Bonferroni’s) has to be performed.

Two at a Time

In this approach – which is preferred by some agencies, e.g., the EMA18 19 20 and the FDA,21 22 as well as recommended by the ICH17 – all treatments except the two of interest are excluded. This means that we obtain separate Incomplete Block Designs23 24 (IBD), where the coding (treatments, sequences, periods) is kept and analyses of the respective pairs of interest are performed.

Williams’ designs with three and four treatments and the resulting IBDs (excluded treatments are denoted by \(\small{{\color{Red}\bullet}\,}\)): \[\small{\begin{array}{c|ccc} s/p & \text{I} & \text{II} & \text{III} \\\hline 1 & \text{A} & \text{B} & \text{C}\\ 2 & \text{A} & \text{C} & \text{B}\\ 3 & \text{B} & \text{A} & \text{C}\\ 4 & \text{B} & \text{C} & \text{A}\\ 5 & \text{C} & \text{A} & \text{B}\\ 6 & \text{C} & \text{B} & \text{A}\\ \end{array}{\color{Blue}\mapsto} \begin{array}{c|ccc} s/p & \text{I} & \text{II} & \text{III} \\\hline 1 & \text{A} & {\color{Red}\bullet} & \text{C}\\ 2 & \text{A} & \text{C} & {\color{Red}\bullet}\\ 3 & {\color{Red}\bullet} & \text{A} & \text{C}\\ 4 & {\color{Red}\bullet} & \text{C} & \text{A}\\ 5 & \text{C} & \text{A} & {\color{Red}\bullet}\\ 6 & \text{C} & {\color{Red}\bullet} & \text{A}\\ \end{array}{\color{Blue}\wedge} \begin{array}{c|ccc} s/p & \text{I} & \text{II} & \text{III} \\\hline 1 & {\color{Red}\bullet} & \text{B} & \text{C}\\ 2 & {\color{Red}\bullet} & \text{C} & \text{B}\\ 3 & \text{B} & {\color{Red}\bullet} & \text{C}\\ 4 & \text{B} & \text{C} & {\color{Red}\bullet}\\ 5 & \text{C} & {\color{Red}\bullet} & \text{B}\\ 6 & \text{C} & \text{B} & {\color{Red}\bullet}\\ \end{array}}\] \[\small{\begin{array}{c|cccc} s/p & \text{I} & \text{II} & \text{III} & \text{IV}\\\hline 1 & \text{A} & \text{C} & \text{B} & \text{D}\\ 2 & \text{B} & \text{A} & \text{D} & \text{C}\\ 3 & \text{C} & \text{D} & \text{A} & \text{B}\\ 4 & \text{D} & \text{B} & \text{C} & \text{A}\\ \end{array}{\color{Blue}\mapsto} \begin{array}{c|cccc} s/p & \text{I} & \text{II} & \text{III} & \text{IV} \\\hline 1 & \text{A} & {\color{Red}\bullet} & {\color{Red}\bullet} & \text{D}\\ 2 & {\color{Red}\bullet} & \text{A} & \text{D} & {\color{Red}\bullet}\\ 3 & {\color{Red}\bullet} & \text{D} & \text{A} & {\color{Red}\bullet}\\ 4 & \text{D} & {\color{Red}\bullet} & {\color{Red}\bullet} & \text{A}\\ \end{array}{\color{Blue}\wedge} \begin{array}{c|cccc} s/p & \text{I} & \text{II} & \text{III} & \text{IV}\\\hline 1 & {\color{Red}\bullet} & {\color{Red}\bullet} & \text{B} & \text{D}\\ 2 & \text{B} & {\color{Red}\bullet} & \text{D} & {\color{Red}\bullet}\\ 3 & {\color{Red}\bullet} & \text{D} & {\color{Red}\bullet} & \text{B}\\ 4 & \text{D} & \text{B} & {\color{Red}\bullet} & {\color{Red}\bullet}\\ \end{array}{\color{Blue}\wedge} \begin{array}{c|cccc} s/p & \text{I} & \text{II} & \text{III} & \text{IV}\\\hline 1 & {\color{Red}\bullet} & \text{C} & {\color{Red}\bullet} & \text{D}\\ 2 & {\color{Red}\bullet} & {\color{Red}\bullet} & \text{D} & \text{C}\\ 3 & \text{C} & \text{D} & {\color{Red}\bullet} & {\color{Red}\bullet}\\ 4 & \text{D} & {\color{Red}\bullet} & \text{C} & {\color{Red}\bullet}\\ \end{array}}\] With Latin Squares (\(\small{\text{ABC}|\text{BCA}|\text{CAB}}\), \(\small{\text{ABCD}|\text{BCDA}|\text{CDAB}|\text{DABC}}\), etc.) balanced IBDs cannot be achieved.

An example to get a randomization list and extract the IBDs. We assume a CV of 0.25, a T/R-ratio of 0.95, target a power of 0.80, and plan the evaluation for the ‘Two at a Time’ approach.

make.equal <- function(n, ns) {# equally sized sequences
  return(as.integer(ns * (n %/% ns + as.logical(n %% ns))))
}
nadj <- function(n, do.rate, ns) {
  return(as.integer(make.equal(n / (1 - do.rate), ns)))
}
ntmt    <- 4     # number of treatments (at least 3)
CV      <- 0.25  # assumed CV
theta0  <- 0.95  # assumed T/R-ratio
theta1  <- 0.80  # lower and
theta2  <- 1.25  # upper BE limit
target  <- 0.80  # target (desired) power
bal     <- FALSE # TRUE if the study should have balanced sequences
do.rate <- 0     # anticipated dropout-rate
# sample size for evaluation by ‘Two at a Time’
# note that the design is specified as "2x2"
n       <- sampleN.TOST(CV = CV, theta0 = theta0, targetpower = target,
                        theta1 = theta1, theta2 = theta2,
                        design = "2x2", print = FALSE)[["Sample size"]]
# sequences of a Williams’ design
seqs    <- williams(ntmt)
ns      <- 2 # because 2x2!
# adjust for dropout-rate
n       <- nadj(n, do.rate, ns)
# eventual uprounding if balance is requested
if (bal) n <- make.equal(n, length(seqs))
txt1    <- paste("Sequences:", paste(seqs, collapse = ", "), "\n")
s       <- paste0(seqs, collapse = "")
# the last treatment in lexical order will act as the reference
s       <- paste(sort(unlist(strsplit(s, ""))), collapse = "")
ref     <- substr(s, nchar(s), nchar(s))
# extract treatment codes
trts    <- unique(c(strsplit(split = "", seqs), recursive = TRUE))
# extract tests in increasing order
tests   <- sort(trts[!trts == ref])
calls   <- 0
repeat {# call randomizations until all IBDs are balanced
  # get a randomization list
  calls <- calls + 1
  tmp   <- suppressWarnings( # not relevant for TaT
             RL4(nsubj = n, seqs = seqs, randctrl = FALSE))
  rand  <- tmp$rl
  # generate columns for the IBDs
  for (j in seq_along(tests)) {
    rand[[paste0("IBD.", j)]] <- NA
  }
  # build the IBDs
  for (j in 1:nrow(rand)) {
    for (k in seq_along(tests)) {
      excl           <- tests[!tests == tests[k]]
      excl           <- paste0("[", paste(excl, collapse = ", "), "]")
      rand[j, 3 + k] <- gsub(excl, "•", rand$sequence[j])
    }
  }
  # check IBDs for balance
  checks <- rep(NA, length(tests))
  for (j in seq_along(tests)) {
    ibd <- gsub("[^[:alnum:], ]", "", rand[3 + j])
    ibd <- gsub("c", "", ibd)
    ibd <- gsub(" ", "", unlist(strsplit(ibd, ",")))
    checks[j] <- length(ibd[ibd == unique(ibd)[1]]) ==
                 length(ibd[ibd == unique(ibd)[2]])
  }
  if (sum(checks) == length(tests)) break
}
ibd  <- names(rand)[4:ncol(rand)]
names(rand)[4:ncol(rand)] <- gsub("\\.", " ", ibd)
txt2 <- paste0("All extracted IBDs are balanced.",
               "\nRandomized                      : ",
               format(tmp$date, usetz = TRUE),
               "\nSeeed                           : ", tmp$seed,
               "\nCalls until balance was achieved: ", calls, "\n")

cat(txt1)
print(rand, row.names = FALSE)
cat(txt2)
# Sequences: ACBD, BADC, CDAB, DBCA 
#  subject seqno sequence IBD 1 IBD 2 IBD 3
#        1     1     ACBD  A••D  ••BD  •C•D
#        2     4     DBCA  D••A  DB••  D•C•
#        3     4     DBCA  D••A  DB••  D•C•
#        4     2     BADC  •AD•  B•D•  ••DC
#        5     2     BADC  •AD•  B•D•  ••DC
#        6     3     CDAB  •DA•  •D•B  CD••
#        7     1     ACBD  A••D  ••BD  •C•D
#        8     3     CDAB  •DA•  •D•B  CD••
#        9     3     CDAB  •DA•  •D•B  CD••
#       10     2     BADC  •AD•  B•D•  ••DC
#       11     4     DBCA  D••A  DB••  D•C•
#       12     3     CDAB  •DA•  •D•B  CD••
#       13     2     BADC  •AD•  B•D•  ••DC
#       14     1     ACBD  A••D  ••BD  •C•D
#       15     4     DBCA  D••A  DB••  D•C•
#       16     1     ACBD  A••D  ••BD  •C•D
#       17     4     DBCA  D••A  DB••  D•C•
#       18     4     DBCA  D••A  DB••  D•C•
#       19     2     BADC  •AD•  B•D•  ••DC
#       20     3     CDAB  •DA•  •D•B  CD••
#       21     3     CDAB  •DA•  •D•B  CD••
#       22     1     ACBD  A••D  ••BD  •C•D
#       23     2     BADC  •AD•  B•D•  ••DC
#       24     1     ACBD  A••D  ••BD  •C•D
#       25     3     CDAB  •DA•  •D•B  CD••
#       26     2     BADC  •AD•  B•D•  ••DC
#       27     1     ACBD  A••D  ••BD  •C•D
#       28     4     DBCA  D••A  DB••  D•C•
# All extracted IBDs are balanced.
# Randomized                      : 2022-07-11 12:42:25 CEST
# Seeed                           : 3965897
# Calls until balance was achieved: 1

In this example we have seven subjects in all sequences, i.e., the design is balanced. This will not be so if the sample size is not a multiple of the number of sequences. However, that’s not relevant because we aim at balance of the IBDs. If you desire balance also for carryover,25 specify bal <- TRUE, which will require more subjects in such a case.

Since we are working in a regulated environment, perhaps you will be asked by an inspector how you performed the randomization. You need to provide the sample size, all sequences and the seed to reproduce it.

library(randomizeBE)
RL4(n = 28, seqs = c("ACBD", "BADC", "CDAB", "DBCA"), seed = 3965897)

Although there is just one Williams’ design for three treatments, there are six for four treatments, twelve for five, 120 for six, and 360 for seven. The function williams() of randomizeBE will arbitrarily select one of them. Hence, if you run the script, the chance to get the same sequences like in the example is 1/120.

    
For expe🇷ts

Apart from the regulatory recommendations17 18 19 20 21 22 28 I suggest this approach, because the ‘All at Once’ approach may lead to biased estimates and an inflated Type I Error.5 29

All at Once

In this approach we assume homogenicity in the ANOVA of pooled data (equal variances in the pairwise comparisons) and obtain one residual variance.

To plan for this approach specify one of the design arguments given above, i.e., "3x3", "3x6x3", or "4x4".

Whilst popular in the past (and hence, used in most of the examples), I don’t recommend it. Note that more than four treat­ments are not implemented in sampleN.TOST(). Consider the ‘Two at a Time’ approach instead (see the examples below).

top of section ↩︎ previous section ↩︎

    

A Simple Case

We assume a CV of 0.25, a T/R-ratio of 0.95, and target a power of 0.80 in a 6-sequence 3-period Williams’ design.

Since theta0 = 0.95 and targetpower = 0.80 are defaults of the function, we don’t have to give them explicitly. As usual in BE, alpha = 0.05 is employed (we will assess the study by a \(\small{100\,(1-2\,\alpha)=90\%}\) confidence interval). The BE-limits are theta1 = 0.80 and theta2 = 1.25. Since they are also defaults of the function, we don’t have to give them as well. Hence, we need to specify only the CV and the design.

sampleN.TOST(CV = 0.25, design = "3x6x3") # All at Once
# 
# +++++++++++ Equivalence test - TOST +++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 3x6x3 crossover 
# log-transformed data (multiplicative model)
# 
# alpha = 0.05, target power = 0.8
# BE margins = 0.8 ... 1.25 
# True ratio = 0.95,  CV = 0.25
# 
# Sample size (total)
#  n     power
# 30   0.843006
sampleN.TOST(CV = 0.25, design = "2x2")   # Two at a Time
# 
# +++++++++++ Equivalence test - TOST +++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover 
# log-transformed data (multiplicative model)
# 
# alpha = 0.05, target power = 0.8
# BE margins = 0.8 ... 1.25 
# True ratio = 0.95,  CV = 0.25
# 
# Sample size (total)
#  n     power
# 28   0.807439

Don’t get confused by Study design: 2x2 crossover in the output. It only means that we will evaluate the study by the ‘Two at a Time’ approach as IBDs. In the ‘All at Once’ approach we have more degrees of freedom (\(\small{2\times30-4=56}\)) than in the ‘Two at a Time’ approach (\(\small{28-2=26}\)). However, this benefit is compensated by the fact that we need a multiple of six in the William’s design and only an even number in the ‘Two at a Time’ approach. An example to get the correct sample size is given in the Interlude further down.

Sometimes we are not interested in the entire output and want to use only a part of the results in subsequent calculations. We can suppress the output by the argument print = FALSE and assign the result to a data frame (here named x).

x <- sampleN.TOST(CV = 0.25, design = "3x6x3",
                  print = FALSE)

Although you could access the elements by the number of the column(s), I don’t recommend that, since in various functions these numbers are different and hence, difficult to remember.

Let’s retrieve the column names of x:

names(x)
# [1] "Design"         "alpha"          "CV"            
# [4] "theta0"         "theta1"         "theta2"        
# [7] "Sample size"    "Achieved power" "Target power"

Now we can access the elements of x by their names. Note that double square brackets [[…]] have to be used.

x[["Sample size"]]
# [1] 30
x[["Achieved power"]]
# [1] 0.8430065

Although you could access the elements by the number of the column(s), I don’t recommend that, since in various other functions of PowerTOST these numbers are different and hence, difficult to remember. If you insist in accessing elements by column-numbers, use single square brackets […].

x[7:8]
#   Sample size Achieved power
# 1          30      0.8430065

With 30 subjects (five per sequence) we achieve the power we desire.

What happens if we have three dropouts?

power.TOST(CV = 0.25, design = "3x6x3",
           n = x[["Sample size"]] - 3)
# Unbalanced design. n(i)=5/5/5/4/4/4 assumed.
# [1] 0.7986552

Slightly below the 0.80 we desire.


    

Interlude 1

As we have seen above, if we plan to evaluate the study by the ‘Two at a Time’ approach, we will get an even number of subjects from sampleN.TOST(design = "2x2"). If we want balanced sequences in our design (which will be a multiple of three in the "3x3" design, a multiple of six in the "3x6x3" design, and a multiple of four in the "4x4" design), sometimes we have to round the sample size up.

get.sequences <- function(design) {
  known <- known.designs()[, c(2, 5)]
  ns    <- as.integer(known[known$design == design,
                            "steps"])
  return(ns)
}
make.equal <- function(n, ns) {
  return(as.integer(ns * (n %/% ns + as.logical(n %% ns))))
}
CV     <- 0.25
design <- "3x6x3"
ns     <- get.sequences(design)
n.TaT  <- sampleN.TOST(CV = CV, design = "2x2x2",
                       print = FALSE)[["Sample size"]]
n.AaO  <- sampleN.TOST(CV = CV, design = design,
                       print = FALSE)[["Sample size"]]
n      <- make.equal(n = n.TaT, ns = ns)
df     <- data.frame(design = design,
                     approach = c("\u2018All at Once\u2019",
                                  "\u2018Two at a Time\u2019"),
                     n = c(n.AaO, n), power = NA)
df[1, 4] <- power.TOST(CV = CV, design = design,
                       n = n.AaO)
df[2, 4] <- power.TOST(CV = CV, design = "2x2x2",
                       n = n)
print(df, row.names = FALSE)
#  design        approach  n     power
#   3x6x3   ‘All at Once’ 30 0.8430065
#   3x6x3 ‘Two at a Time’ 30 0.8342518

This uprounding from the sample size of the ‘All at Once’ approach protects us also against loss of power due to dropouts.

However, if planning for the ‘Two at a Time’ approach, it is not necessary to have balanced sequences – only the resulting IBDs have to be. Hence, we could plan it as a conventional 2×2 crossover.

design     <- "3x6x3"
CV         <- 0.25
df         <- data.frame(design = design,
                         approach = "\u2018Two at a Time\u2019",
                         n = NA, power = NA)
df[1, 3:4] <- sampleN.TOST(CV = CV, design = "2x2", # for TaT use "2x2"
                       print = FALSE)[7:8]
print(df, row.names = FALSE)
#  design        approach  n     power
#   3x6x3 ‘Two at a Time’ 28 0.8074395

Since dropouts are common, it makes sense to include / dose more subjects in order to end up with a number of eligible subjects which is not lower than our initial estimate.

Let us explore that in the next section.

top of section ↩︎ previous section ↩︎

    

Dropouts

We define two supportive functions:

  1. Provide equally sized sequences, i.e., any total sample size n will be rounded up to the next multiple of the number of sequences ns.
make.equal <- function(n, ns) {
  return(as.integer(ns * (n %/% ns + as.logical(n %% ns))))
}
  1. Provide the adjusted sample size based on the original sample size n, the anticipated droput-rate do.rate, and the number of sequences ns.
nadj <- function(n, do.rate, ns) {
  return(as.integer(make.equal(n / (1 - do.rate), ns)))
}
    

In order to come up with a suggestion we have to anticipate a (realistic!) dropout rate. Note that this not the job of the statistician; ask the Principal Investigator.

It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.
Arthur Conan Doyle (1892)30

Dropout-rate

    

The dropout-rate is calculated from the eligible and dosed subjects
or simply \[\begin{equation}\tag{3} do.rate=1-n_\textrm{eligible}/n_\textrm{dosed} \end{equation}\] Of course, we know it only after the study was performed.

By substituting \(n_\textrm{eligible}\) with the estimated sample size \(n\), providing an anticipated dropout-rate and rearrangement to find the adjusted number of dosed subjects \(n_\textrm{adj}\) we should use \[\begin{equation}\tag{4} n_\textrm{adj}=\;\upharpoonleft n\,/\,(1-do.rate) \end{equation}\] where \(\upharpoonleft\) denotes rounding up to the next even number as implemented in the functions above.

An all too common mistake is to increase the estimated sample size \(n\) by the drop out-rate according to \[\begin{equation}\tag{5} n_\textrm{adj}=\;\upharpoonleft n\times(1+do.rate) \end{equation}\]

If you used \(\small{(5)}\) in the past – you are not alone. In a small survey a whopping 29% of respondents reported to use it.31 Consider changing your routine.

There are no routine statistical questions, only questionable statistical routines.

top of section ↩︎ previous section ↩︎

Adjusted Sample Size

    

In the following I specified more arguments to make the function more flexible.
Note that I wrapped the function power.TOST() in suppressMessages(). Otherwise, the function will throw for any odd sample size a message telling us that the design is unbalanced. Well, we know that.

CV      <- 0.25 # within-subject CV
target  <- 0.80 # target (desired) power
theta0  <- 0.95 # assumed T/R-ratio
theta1  <- 0.80 # lower BE limit
theta2  <- 1.25 # upper BE limit
design  <- "3x6x3"
ns      <- 6
do.rate <- 0.15 # anticipated dropout-rate 10%
df      <- sampleN.TOST(CV = CV, theta0 = theta0,
                        theta1 = theta1,
                        theta2 = theta2,
                        targetpower = target,
                        design = design,
                        print = FALSE)
# calculate the adjusted sample size
n.adj   <- nadj(df[["Sample size"]], do.rate, ns)
# (decreasing) vector of eligible subjects
n.elig  <- n.adj:df[["Sample size"]]
info    <- paste0("Assumed CV              : ",
                  CV,
                  "\nAssumed T/R ratio       : ",
                  theta0,
                  "\nBE limits               : ",
                  paste(theta1, theta2, sep = "\u2026"),
                  "\nTarget (desired) power  : ",
                  target,
                  "\nAnticipated dropout-rate: ",
                  do.rate,
                  "\nEstimated sample size   : ",
                  df[["Sample size"]], " (",
                  df[["Sample size"]]/ns, "/sequence)",
                  "\nAchieved power          : ",
                  signif(df[["Achieved power"]], 4),
                  "\nAdjusted sample size    : ",
                  n.adj, " (",  n.adj/ns, "/sequence)",
                  "\n\n")
# explore the potential outcome for
# an increasing number of dropouts
do.act <- signif((n.adj - n.elig) / n.adj, 4)
df     <- data.frame(dosed    = n.adj,
                     eligible = n.elig,
                     dropouts = n.adj - n.elig,
                     do.act   = do.act,
                     power    = NA)
for (i in 1:nrow(df)) {
  df$power[i] <- suppressMessages(
                   power.TOST(CV = CV,
                              theta0 = theta0,
                              theta1 = theta1,
                              theta2 = theta2,
                              design = design,
                              n = df$eligible[i]))
}
cat(info)
print(round(df, 4), row.names = FALSE)
# Assumed CV              : 0.25
# Assumed T/R ratio       : 0.95
# BE limits               : 0.8…1.25
# Target (desired) power  : 0.8
# Anticipated dropout-rate: 0.15
# Estimated sample size   : 30 (5/sequence)
# Achieved power          : 0.843
# Adjusted sample size    : 36 (6/sequence)
# 
#  dosed eligible dropouts do.act  power
#     36       36        0 0.0000 0.8997
#     36       35        1 0.0278 0.8906
#     36       34        2 0.0556 0.8814
#     36       33        3 0.0833 0.8720
#     36       32        4 0.1111 0.8624
#     36       31        5 0.1389 0.8528
#     36       30        6 0.1667 0.8430

In the worst case (six dropouts) we end up with the originally estimated sample size of 30. Power preserved, mission accomplished. If we have less dropouts, splendid – we gain power.

If we would have adjusted the sample size acc. to \(\small{(5)}\) we would have dosed also 36 subjects like the 36 acc. to \(\small{(4)}\).
Cave: This might not always be the case… If the anticipated dropout rate of 15% is realized in the study, we would have also 30 eligible subjects (power 0.843). In this example we achieve still more than our target power but the loss might be relevant in other cases.

top of section ↩︎ previous section ↩︎

Post hoc Power

    

As said in the preliminaries, calculating post hoc power is futile.

There is simple intuition behind results like these: If my car made it to the top of the hill, then it is powerful enough to climb that hill; if it didn’t, then it obviously isn’t powerful enough. Retrospective power is an obvious answer to a rather uninteresting question. A more meaningful question is to ask whether the car is powerful enough to climb a particular hill never climbed before; or whether a different car can climb that new hill. Such questions are prospective, not retrospective.
Russell V. Lenth (2000)32

However, sometimes we are interested in it for planning the next study.

If you give a total sample size n which is not a multiple of sequences, power.TOST() will try to keep sequences as balanced as possible and show in a message how that was done.

n.act <- 32
signif(power.TOST(CV = 0.25, design = "3x6x3",
                  n = n.act), 6)
# Unbalanced design. n(i)=6/6/5/5/5/5 assumed.
# [1] 0.862425

Say, our study was more unbalanced. Let us assume that we dosed 36 subjects, the total number of subjects was also 32 but all drop­outs occured in one sequence (unlikely but possible).
Instead of the total sample size n we can give the number of subjects of each sequence as a vector (the order is not relevant, i.e., it does not matter which element refers to which sequence).

n.adj    <- head(df$dosed, 1)
n.eli    <- rep(n.adj / ns, ns)
n.eli[1] <- 2
post.hoc <- signif(power.TOST(CV  = 0.25,
                              n = n.eli,
                              design = design), 6)
sig.dig  <- nchar(as.character(n.adj))
fmt      <- paste0("%", sig.dig, ".0f (%",
                   sig.dig, ".0f dropouts)")
info     <- paste0("Dosed subjects: ",
                   sprintf("%2.0f", n.adj),
                   "\nEligible      : ",
                   sprintf(fmt, n.act,
                   n.adj - n.act))
for (i in seq_along(n.eli)) {
  info <- paste(info, "\n  Sequence ", i, ":",
                sprintf(fmt, n.eli[i],
                n.adj / ns - n.eli[i]))
}
info     <- paste(info, "\nPower         : ",
                  post.hoc, "\n")
cat(info)
# Dosed subjects: 36
# Eligible      : 32 ( 4 dropouts) 
#   Sequence  1 :  2 ( 4 dropouts) 
#   Sequence  2 :  6 ( 0 dropouts) 
#   Sequence  3 :  6 ( 0 dropouts) 
#   Sequence  4 :  6 ( 0 dropouts) 
#   Sequence  5 :  6 ( 0 dropouts) 
#   Sequence  6 :  6 ( 0 dropouts) 
# Power         :  0.805284

Of course, in a particular study you will provide the numbers in the n vector directly.

top of section ↩︎ previous section ↩︎

Lost in Assumptions

    

As stated already in the Introduction, the CV and the T/R-ratio are only assumptions. Whatever their origin might be (literature, previous studies) they carry some degree of uncertainty. Hence, believing33 that they are the true ones may be risky.
Some statisticians call that the ‘Carved-in-Stone’ approach.

Say, we performed a pilot study in 12 subjects and estimated the CV as 0.25.

The \(\alpha\) confidence interval of the CV is obtained via the \(\small{\chi^2}\)-distribution of its error variance \(\small{\sigma^2}\) with \(\small{n-2}\) degrees of freedom. \[\eqalign{\tag{6} s_\text{w}^2&=\log_{e}(CV_\text{w}^2+1)\\ L=\frac{(n-1)\,s_\text{w}^2}{\chi_{\alpha/2,\,n-2}^{2}}&\leq\sigma_\text{w}^2\leq\frac{(n-1)\,s_\text{w}^2}{\chi_{1-\alpha/2,\,n-2}^{2}}=U\\ \left\{lower\;CL,\;upper\;CL\right\}&=\left\{\sqrt{\exp(L)-1},\sqrt{\exp(U)-1}\right\} }\] Let’s calculate the 95% confidence interval of the CV to get an idea.

m  <- 12 # pilot study
ci <- CVCL(CV = 0.25, df = m - 2,
           side = "2-sided", alpha = 0.05)
signif(ci, 4)
# lower CL upper CL 
#   0.1733   0.4531

Surprised? Although 0.25 is the best estimate for planning the next study, there is no guarantee that we will get exactly the same outcome. Since the \(\small{\chi^2}\)-distribution is skewed to the right, it is more likely to face a higher CV than a lower one in the pivotal study.

If we plan the study based on 0.25, we would opt for 30 subjects like in the examples before (not adjusted for the dropout-rate).
If the CV will be lower, we gain power. Fine.
But what if it will be higher? Of course, we will loose power. But how much?

Let’s explore what might happen at the confidence limits of the CV.

m    <- 12
ci   <- CVCL(CV = 0.25, df = m - 2,
             side = "2-sided", alpha = 0.05)
n    <- 30
comp <- data.frame(CV = c(ci[["lower CL"]], 0.25,
                          ci[["upper CL"]]),
                   power = NA)
for (i in 1:nrow(comp)) {# evaluation by TaT
  comp$power[i] <- power.TOST(CV = comp$CV[i],
                              design = "2x2x2",
                              n = n)
}
comp[, 1] <- signif(comp[, 1], 4)
comp[, 2] <- signif(comp[, 2], 6)
print(comp, row.names = FALSE)
#      CV    power
#  0.1733 0.983327
#  0.2500 0.834252
#  0.4531 0.225243

Ouch! Note that any power < 0.5 is a failure.

What can we do? The larger the previous study was, the larger the degrees of freedom and hence, the narrower the confidence interval of the CV. In simple terms: The estimate is more certain. On the other hand, it also means that very small pilot studies are practically useless.

m               <- seq(6, 18, ns)
df              <- data.frame(n = m, CV = 0.25,
                              l = NA, u = NA)
for (i in 1:nrow(df)) {
  df[i, 3:4] <- CVCL(CV = 0.25,
                     df = m[i] - 2,
                     side = "2-sided",
                     alpha = 0.05)
}
df[, 3:4]      <- signif(df[, 3:4], 4)
names(df)[3:4] <- c("lower CL", "upper CL")
print(df, row.names = FALSE)
#   n   CV lower CL upper CL
#   6 0.25   0.1483   0.8060
#  12 0.25   0.1733   0.4531
#  18 0.25   0.1849   0.3883
    

A Bayesian method is implemented in PowerTOST, which takes the uncertainty of estimates into account. We can explore the uncertainty of the CV or the T/R-ratio and of both. In the examples we plan for evaluation by the ‘Two at a Time’ approach and hence, specify design = "2x2x2".

m      <- 12   # sample size of pilot
CV     <- 0.25
theta0 <- 0.95
df     <- expsampleN.TOST(CV = CV, theta0 = theta0,
                          targetpower = 0.80,
                          design = "2x2x2",
                          prior.parm = list(
                            m = m, design = "3x6x3"),
                          prior.type = "CV",
                          details = FALSE)
# 
# ++++++++++++ Equivalence test - TOST ++++++++++++
#        Sample size est. with uncertain CV
# -------------------------------------------------
# Study design:  2x2 crossover 
# log-transformed data (multiplicative model)
# 
# alpha = 0.05, target power = 0.8
# BE margins = 0.8 ... 1.25 
# Ratio = 0.95
# CV = 0.25 with 20 df
# 
# Sample size (ntotal)
#  n   exp. power
# 32   0.819115

Not that bad. 7% more subjects required.

What about an uncertain T/R-ratio?

m      <- 12
CV     <- 0.25
theta0 <- 0.95
df     <- expsampleN.TOST(CV = CV, theta0 = theta0,
                          targetpower = 0.80,
                          design = "2x2x2",
                          prior.parm = list(
                            m = m, design = "3x6x3"),
                          prior.type = "theta0",
                          details = FALSE)
# 
# ++++++++++++ Equivalence test - TOST ++++++++++++
#      Sample size est. with uncertain theta0
# -------------------------------------------------
# Study design:  2x2 crossover 
# log-transformed data (multiplicative model)
# 
# alpha = 0.05, target power = 0.8
# BE margins = 0.8 ... 1.25 
# Ratio = 0.95
# CV = 0.25
# 
# Sample size (ntotal)
#  n   exp. power
# 72   0.800974

Terrible! What a nightmare.

That should not take us by surprise. We don’t know where the true T/R-ratio lies but we can calculate the lower 95% confidence limit of the pilot study’s point estimate to get an idea about a worst case.

m      <- 12
CV     <- 0.25
pe     <- 0.95
ci     <- round(CI.BE(CV = CV, pe = 0.95,
                      n = m, design = "2x2x2"), 4)
if (pe <= 1) {
  cl <- ci[["lower"]]
} else {
  cl <- ci[["upper"]]
}
print(cl)
# [1] 0.7918

Exlore the impact of a relatively 5% higher CV and a relatively 5% lower T/R-ratio on power for a given sample size.

n      <- 30
CV     <- 0.25
theta0 <- 0.95
comp1  <- data.frame(CV = c(CV, CV*1.05),
                     power = NA)
comp2  <- data.frame(theta0 = c(theta0, theta0*0.95),
                     power = NA)
for (i in 1:2) {
  comp1$power[i] <- power.TOST(CV = comp1$CV[i],
                               theta0 = theta0,
                               design = "2x2x2",
                               n = n)
}
comp1$power <- signif(comp1$power, 6)
for (i in 1:2) {
  comp2$power[i] <- power.TOST(CV = CV,
                               theta0 = comp2$theta0[i],
                               design = "2x2x2",
                               n = n)
}
comp2$power <- signif(comp2$power, 6)
print(comp1, row.names = FALSE)
print(comp2, row.names = FALSE)
#      CV    power
#  0.2500 0.834252
#  0.2625 0.799072
#  theta0    power
#  0.9500 0.834252
#  0.9025 0.580952

    

Interlude 2

Fig. 1 Power curves for n = 27 (3×6x3 design evaluated ‘All at Once’) for simplicity.

Note the log-scale of the x-axis. It demonstrates that power curves are symmetrical around 1 (\(\small{\log_{e}(1)=0}\), where \(\small{\log_{e}(\theta_2)=\left|\log_{e}(\theta_1)\right|}\)) and we will achieve the same power for \(\small{\theta_0}\) and \(\small{1/\theta_0}\) (e.g., for 0.95 and 1.0526). Furthermore, all curves intersect 0.05 at \(\small{\theta_1}\) and \(\small{\theta_2}\), which means that for a true \(\small{\theta_0}\) at one of the limits \(\small{\alpha}\) is strictly controlled.

<nitpick>
  • A common flaw in protocols is the phrase
          »The sample size was calculated [sic] based on a T/R-ratio of 0.95 – 1.05…«
    If you assume a deviation of 5% of the test from the reference and are not sure about its direction (lower or higher than 1), always use the lower T/R-ratio. If you would use the upper T/R-ratio, power would be only preserved down to 1/1.05 = 0.9524.
    Given, sometimes you will need a higher sample size with the lower T/R-ratio. There’s no free lunch.
CV      <- 0.25
delta   <- 0.05 # direction unknown
theta0s <- c(1 - delta, 1 / (1 + delta),
             1 + delta, 1 / (1 - delta))
n       <- sampleN.TOST(CV = CV, theta0 = 1 - delta,
                        design = "2x2x2",
                        print = FALSE)[["Sample size"]]
comp1   <- data.frame(CV = CV, theta0 = theta0s,
                      base = c(TRUE, rep(FALSE, 3)),
                      n = n, power = NA)
for (i in 1:nrow(comp1)) {
  comp1$power[i] <- power.TOST(CV = CV,
                               theta0 = comp1$theta0[i],
                               n = n, design = "2x2x2")
}
n       <- sampleN.TOST(CV = CV, theta0 = 1 + delta,
                        design = "2x2x2",
                        print = FALSE)[["Sample size"]]
comp2   <- data.frame(CV = CV, theta0 = theta0s,
                      base = c(FALSE, FALSE, TRUE, FALSE),
                      n = n, power = NA)
for (i in 1:nrow(comp2)) {
  comp2$power[i] <- power.TOST(CV = CV,
                               design = "2x2x2",
                               theta0 = comp2$theta0[i],
                               n = n)
}
comp1[, c(2, 5)] <- signif(comp1[, c(2, 5)] , 4)
comp2[, c(2, 5)] <- signif(comp2[, c(2, 5)] , 4)
print(comp1, row.names = FALSE)
print(comp2, row.names = FALSE)
#    CV theta0  base  n  power
#  0.25 0.9500  TRUE 28 0.8074
#  0.25 0.9524 FALSE 28 0.8163
#  0.25 1.0500 FALSE 28 0.8163
#  0.25 1.0530 FALSE 28 0.8074
#    CV theta0  base  n  power
#  0.25 0.9500 FALSE 28 0.8074
#  0.25 0.9524 FALSE 28 0.8163
#  0.25 1.0500  TRUE 28 0.8163
#  0.25 1.0530 FALSE 28 0.8074

</nitpick>


    

Since power is much more sensitive to the T/R-ratio than to the CV, the results obtained with the Bayesian method should be clear now.

Essentially this leads to the murky waters of prospective power analysis, which is covered in another article.
An appetizer to show the maximum deviations (CV, T/R-ratio and decreased sample size due to dropouts) which give still a minimum acceptable power of ≥ 0.70:

CV       <- 0.25
theta0   <- 0.95
target   <- 0.80
minpower <- 0.70
pa       <- pa.ABE(CV = CV, theta0 = theta0,
                   targetpower = target,
                   minpower = minpower,
                   design = "2x2x2")
change.CV     <- 100*(tail(pa$paCV[["CV"]], 1) -
                      pa$plan$CV) / pa$plan$CV
change.theta0 <- 100*(head(pa$paGMR$theta0, 1) -
                      pa$plan$theta0) /
                      pa$plan$theta0
change.n      <- 100*(tail(pa$paN[["N"]], 1) -
                      pa$plan[["Sample size"]]) /
                      pa$plan[["Sample size"]]
comp     <- data.frame(parameter = c("CV", "theta0", "n"),
                       change = c(change.CV,
                                  change.theta0,
                                  change.n))
comp$change <- sprintf("%+.2f%%", comp$change)
names(comp)[2] <- "relative change"
print(pa, plotit = FALSE)
print(comp, row.names = FALSE)
# Sample size plan ABE
#  Design alpha   CV theta0 theta1 theta2 Sample size
#   2x2x2  0.05 0.25   0.95    0.8   1.25          28
#  Achieved power
#       0.8074395
# 
# Power analysis
# CV, theta0 and number of subjects leading to min. acceptable power of ~0.7:
#  CV= 0.2843, theta0= 0.9268
#  n = 23 (power= 0.7173)
# 
#  parameter relative change
#         CV         +13.70%
#     theta0          -2.44%
#          n         -17.86%

Confirms what we have seen above. Interesting that the sample size is the least sensitive one. Many overrate the impact of dropouts on power.


If you didn’t stop reading in desperation (understandable), explore both uncertain CV and T/R-ratio:

m      <- 12
CV     <- 0.25
theta0 <- 0.95
df     <- expsampleN.TOST(CV = CV, theta0 = theta0,
                          targetpower = 0.80,
                          design = "2x2x2",
                          prior.parm = list(
                            m = m, design = "3x6x3"),
                          prior.type = "both",
                          details = FALSE)
# 
# ++++++++++++ Equivalence test - TOST ++++++++++++
#   Sample size est. with uncertain CV and theta0
# -------------------------------------------------
# Study design:  2x2 crossover 
# log-transformed data (multiplicative model)
# 
# alpha = 0.05, target power = 0.8
# BE margins = 0.8 ... 1.25 
# Ratio = 0.95 with 20 df
# CV = 0.25 with 20 df
# 
# Sample size (ntotal)
#  n   exp. power
# 88   0.801273

I don’t suggest that you should use it in practice. AFAIK, not even the main author of this function (Benjamin Lang) does. However, it serves educational purposes to show that it is not that easy and why even properly planned studies might fail.

    

An alternative to assuming a T/R-ratio is based on ‘Statistical Assurance’.34 This concept uses the distribution of T/R-ratios and assumes an uncertainty parameter \(\small{\sigma_\textrm{u}}\). A natural assumption is \(\small{\sigma_\textrm{u}=1-\theta_0}\), i.e., for the commonly applied T/R-ratio of 0.95 one can use the argument sem = 0.05 in the function expsampleN.TOST(), where the argument theta0 has to be fixed at 1.

CV        <- 0.25
theta0    <- 0.95
target    <- 0.80
sigma.u   <- 1 - theta0
comp      <- data.frame(theta0 = theta0,
                        n.1 = NA, power = NA,
                        sigma.u = sigma.u,
                        n.2 = NA, assurance = NA)
comp[2:3] <- sampleN.TOST(CV = CV,
                          targetpower = target,
                          theta0 = theta0,
                          design = "2x2x2",
                          details = FALSE,
                          print = FALSE)[7:8]
comp[5:6] <- expsampleN.TOST(CV = CV,
                             theta0 = 1, # fixed!
                             targetpower = target,
                             design = "2x2x2",
                             prior.type = "theta0",
                             prior.parm =
                               list(sem = sigma.u),
                             details = FALSE,
                             print = FALSE)[9:10]
names(comp)[c(2, 5)]  <- rep("n", 2)
print(signif(comp, 6), row.names = FALSE)
#  theta0  n    power sigma.u  n assurance
#    0.95 28 0.807439    0.05 28  0.817016

top of section ↩︎ previous section ↩︎

    

Multiple Endpoints

It is not unusual that equivalence of more than one endpoint has to be demonstrated. In bioequivalence the pharmacokinetic metrics Cmax and AUC0–t are mandatory (in some jurisdictions like the FDA additionally AUC0–∞).

We don’t have to worry about multiplicity issues (inflated Type I Error) since if all tests must pass at level \(\small{\alpha}\), we are protected by the intersection-union principle.35 36

We design the study always for the worst case combination, i.e., based on the PK metric requiring the largest sample size. In some jurisdictions wider BE limits for Cmax are acceptable. Let’s explore that with different CVs and T/R-ratios.

metrics <- c("Cmax", "AUCt", "AUCinf")
CV      <- c(0.25, 0.19, 0.20)
theta0  <- c(0.95, 1.04, 1.06)
theta1  <- c(0.75, 0.80, 0.80)
theta2  <- 1 / theta1
target  <- 0.80
df      <- data.frame(metric = metrics,
                      theta1 = theta1, theta2 = theta2,
                      CV = CV, theta0 = theta0, n = NA)
for (i in 1:nrow(df)) {# planned for TaT
  df$n[i] <- sampleN.TOST(CV = CV[i],
                          theta0 = theta0[i],
                          theta1 = theta1[i],
                          theta2 = theta2[i],
                          targetpower = target,
                          design = "2x2x2",
                          print = FALSE)[["Sample size"]]
}
df$theta1 <- sprintf("%.4f", df$theta1)
df$theta2 <- sprintf("%.4f", df$theta2)
txt       <- paste0("Sample size based on ",
                     df$metric[df$n == max(df$n)], ".\n")
print(df, row.names = FALSE)
cat(txt)
#  metric theta1 theta2   CV theta0  n
#    Cmax 0.7500 1.3333 0.25   0.95 16
#    AUCt 0.8000 1.2500 0.19   1.04 16
#  AUCinf 0.8000 1.2500 0.20   1.06 20
# Sample size based on AUCinf.

Even if we assume the same T/R-ratio for two PK metrics, we will get a wider margin for the one with lower variability.

Let’s continue with the conditions of our previous examples, this time assuming that the CV and T/R-ratio were applicable for Cmax. As common in PK, the CV of AUC is lower, say, only 0.20. That means, the study is ‘overpowered’ for the assumed T/R-ratio of AUC.

    

Which are the extreme T/R-ratios (largest deviations of T from R) giving still the target power?

opt <- function(x) {
  power.TOST(theta0 = x, CV = df$CV[2],
             theta1 = theta1, theta2 = theta2,
             n = df$n[1], design = "2x2x2") - target
}
metrics <- c("Cmax", "AUC")
CV      <- c(0.25, 0.20) # Cmax, AUC
theta0  <- 0.95          # both metrics
theta1  <- 0.80
theta2  <- 1.25
target  <- 0.80
df      <- data.frame(metric = metrics, theta0 = theta0,
                      CV = CV, n = NA, power = NA)
for (i in 1:nrow(df)) {# planned for TaT
  df[i, 4:5] <- sampleN.TOST(CV = CV[i],
                             theta0 = theta0,
                             theta1 = theta1, theta2 = theta2,
                             targetpower = target,
                             design = "2x2x2", print = FALSE)[7:8]
}
df$power <- signif(df$power, 6)
if (theta0 < 1) {
  res <- uniroot(opt, tol = 1e-8,
                 interval = c(theta1 + 1e-4, theta0))
} else {
  res <- uniroot(opt, tol = 1e-8,
                 interval = c(theta0, theta2 - 1e-4))
}
res     <- unlist(res)
theta0s <- c(res[["root"]], 1/res[["root"]])
txt     <- paste0("Target power for ", metrics[2],
                 " and sample size ",
                 df$n[1], "\nachieved for theta0 ",
                 sprintf("%.4f", theta0s[1]), " or ",
                 sprintf("%.4f", theta0s[2]), ".\n")
print(df, row.names = FALSE)
cat(txt)
#  metric theta0   CV  n    power
#    Cmax   0.95 0.25 28 0.807439
#     AUC   0.95 0.20 20 0.834680
# Target power for AUC and sample size 28
# achieved for theta0 0.9158 or 1.0920.

That means, although we assumed for AUC the same T/R-ratio as for Cmax – with the sample size of 30 required for Cmax – for AUC it can be as low as ~0.92 or as high as ~1.09, which is a soothing side-effect.

\(\small{\widehat{\lambda}_\textrm{z}}\): Estimated apparent elimination rate-constant.

Furthermore, sometimes we have less data of AUC than of Cmax (samples at the end of the profile missing or unreliable \(\small{\widehat{\lambda}_\textrm{z}}\) in some subjects and therefore, less data of AUC0–∞ than of AUC0–t). Again, it will not hurt because for the originally assumed T/R-ratio we need only 20 subjects.

    

Since – as a one-point metric – Cmax is inherently more variable than AUC, Health Canada does not require assessment of its confidence interval – only the point estimate has to lie within 80.0 – 125.0%. We can explore that by setting alpha = 0.5 for it.37

metrics <- c("Cmax", "AUC")
CV      <- c(0.60, 0.20)
theta0  <- 0.95
target  <- 0.80
alpha   <- c(0.50, 0.05)
df      <- data.frame(metric = metrics, CV = CV, theta0 = theta0,
                      alpha = alpha, n = NA)
for (i in 1:nrow(df)) {# planned for TaT
  df$n[i] <- sampleN.TOST(alpha = alpha[i],
                          CV = CV[i],
                          theta0 = theta0,
                          targetpower = target,
                          design = "2x2x2",
                          print = FALSE)[["Sample size"]]
}
txt       <- paste0("Sample size based on ",
                     df$metric[df$n == max(df$n)], ".\n")
print(df, row.names = FALSE)
cat(txt)
#  metric  CV theta0 alpha  n
#    Cmax 0.6   0.95  0.50 24
#     AUC 0.2   0.95  0.05 20
# Sample size based on Cmax.

Only with a relatively high CV (compared to the one of AUC) you may face a situation where you have to base the sample size on Cmax.

top of section ↩︎ previous section ↩︎

    

Multiple Treatments

First of all we have have to consider the purpose of the study and how we intend to evaluate it (see above).

Say, the number of planned pairwise comparisons is \(\small{k}\). Then17

  1. If all \(\small{k}\) tests should pass (AND-conjunction), the intersection-union principle is applicable and hence, we don’t have to adjust \(\small{\alpha}\).21 22 28
  2. If any of the \(\small{k}\) tests should pass (OR-conjunction), we can use the well-known Bon­ferroni-adjustment\[\begin{equation}\tag{7} \alpha_\text{adj}=\alpha\,/\,k \end{equation}\]for simplicity. The overall Type I Error \(\small{TIE}\) will always be controlled because \[\begin{equation}\tag{8} TIE=1-{(1-\alpha_\text{adj})}^k \end{equation}\] The Bonferroni-adjustment is considered the most conservative method dealing with multiplicity. For \(\small{\alpha=0.05}\) and \(\small{k=2}\) we get \(\small{0.049375<\alpha}\). Of course, all comparisons in the study must be done by \(\small{100\,(1-2\,\alpha_\textrm{adj})}\) confidence intervals.

Extending the basic example and using the argument alpha:

k         <- 2        # comparisons
alpha.adj <- 0.05 / k # Bonferroni
sampleN.TOST(alpha = alpha.adj, CV = 0.25,
             theta0 = 0.95, targetpower = 0.80,
             design = "2x2x2") # planned for TaT
# 
# +++++++++++ Equivalence test - TOST +++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover 
# log-transformed data (multiplicative model)
# 
# alpha = 0.025, target power = 0.8
# BE margins = 0.8 ... 1.25 
# True ratio = 0.95,  CV = 0.25
# 
# Sample size (total)
#  n     power
# 36   0.816081

The sample size increased by ~20% from the 30 subjects required in a single comparison.


Interlude: OR-conjunction demystified

Coint Toss © CC-BY-SA 2.0 ICMA Photos

  1. You toss an ideal coin once and place a bet on heads (or tails, doesn’t matter). What’s the chance to win?
    Obviously ½.
  2. You toss it a second time and place the same bet as before. The chance to win this bet is again ½ (coins have a terribly short memory span).
    What is the chance to win at least one of the two bets?
    We can answer this question for any number of tosses based on the cumulative probability of the binomial distribution (hint: Bernoulli process).
prob <- 1 / 2 # Probability to win a single toss
res  <- data.frame(tosses = 1L:4L, success = NA_real_)
for (i in 1:nrow(res)) {
  res$success[i] <- pbinom(i - 1, i, prob)
}
names(res) <- c("toss(es)", "p (win)")
print(res, row.names = FALSE, right = FALSE)
#  toss(es) p (win)
#  1        0.5000 
#  2        0.7500 
#  3        0.8750 
#  4        0.9375

That’s trivial: The more often you try, the higher the chance to win at least one of the bets.

Coming back to BE: The outcome of each pairwise comparison is dichotomous (pass|fail). Therefore, in the OR-con­junction (case b) we have to adjust the level of the test.
    

With an increasing number of comparisons, the Bon­ferroni-adjustment quickly becomes overly conservative. Another issue is that the tests are not strictly independent (at least when we compare different tests to the same reference treatment some – unknown – correlation exists). More powerful methods (Holm, Hoch­berg, …) are out of scope of this article.

The ICH’s draft guideline17 states:

  1. If the objective is to achieve BE for all test formulations versus the comparator product, then no alpha adjustment is needed.
  2. If the objective is to show BE for any of the test formulations, then multiplicity adjustment may be needed.
ICH (2022)17

Comments:

  1. An adjustment is also not needed when comparing one test to comparators from different regions. This may not be apparent from the text, but has recently been clarified.28 38 Of course, the aim is to show BE of the test to all comparators. The FDA expects in an ANDA the confirmatory BE assessment of the relevant test vs. the RLD; however, the data of the respective other comparator(s) should be presented as well.21 22
  2. Perhaps the little word »may« opens the door for more complex and less restrictive (and hence, more powerful) methods, e.g., Holm-Bonferroni, Hoch­berg, hierarchical testing, … However, likely most assessors are not familiar with them. Since we don’t know beforehand which of the test treatments will perform ‘best’, I suggest to estimate the sample size for Bonferroni, gaining power for the others. It is of utmost importance to state the planned procedure unambigously in the protocol.39
    

Three treatments intended for evaluation ‘All at Once’ or ‘Two at a Time’. Equivalence is seeked for both (say, a tablet and a capsule vs one reference or one test vs references from two regions).

CV <- 0.25
df <- data.frame(design = c("3x3", "3x6x3", "2x2x2"),
                 n = NA_integer_, power = NA_real_)
for (i in 1:nrow(df)) {
  df[i, 2:3] <- sampleN.TOST(alpha = 0.05 / 2,
                             CV = CV,
                             design = df$design[i],
                             print = FALSE)[7:8]
}
df$power <- signif(df$power, 4)
print(df, row.names = FALSE)
#  design  n  power
#     3x3 36 0.8278
#   3x6x3 36 0.8278
#   2x2x2 36 0.8161

Four treatments (e.g., Test fasting and fed state, Reference fasting and fed state). Two comparisons are confirmatory (Test vs Reference fasting, Test vs Reference fed) and two are exploratory (Test fed vs fasting, Reference fed vs fasting). We have to adjust only for the former since the latter are only ‘nice to know’.

CV <- 0.25
df <- data.frame(design = c("4x4", "2x2x2"),
                 n = NA_integer_, power = NA_real_)
for (i in 1:nrow(df)) {
  df[i, 2:3] <- sampleN.TOST(alpha = 0.05 / 2,
                             CV = CV,
                             design = df$design[i],
                             print = FALSE)[7:8]
}
df$power <- signif(df$power, 4)
print(df, row.names = FALSE)
#  design  n  power
#     4x4 36 0.8315
#   2x2x2 36 0.8161

top of section ↩︎ previous section ↩︎

    

NTIDs

So far we employed the common (and hence, default) BE-limits theta1 = 0.80 and theta2 = 1.25. In some jurisdictions for Narrow Therapeutic Index Drugs tighter limits of 90.00 – 111.11% have to be used.40

Generally NTIDs show a low within-subject variability (though the between-subject CV can be much higher – these drugs require quite often dose-titration).

You have to provide only the lower BE-limit theta1 (the upper one will be automatically calculated). In the examples I used a low CV of 0.125.

sampleN.TOST(CV = 0.125, theta1 = 0.90,
             design = "2x2x2") # planned for TaT
# 
# +++++++++++ Equivalence test - TOST +++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover 
# log-transformed data (multiplicative model)
# 
# alpha = 0.05, target power = 0.8
# BE margins = 0.9 ... 1.111111 
# True ratio = 0.95,  CV = 0.125
# 
# Sample size (total)
#  n     power
# 68   0.805372

Not so nice. If you would exhaust the capacity of the clinical site, you may consider a replicate design.


Interlude 3

Health Canada requires for NTIDs (termed by HC ‘critical dose drugs’) that the confidence interval of Cmax lies within 80.0 – 125.0% and the one of AUC within 90.0 – 112.0%.

<nitpick>

\(\small{100\frac{1}{0.900}\,{\color{Red}\neq}\,112.0\%}\), right? I always thought that it’s \(\small{111.1\dot{1}\%}\)… Consequently, on the average generic products will be approved with \(\small{\sqrt{90.0\times112.0}\approx100.4\%}\). When asked, the reply was:
»These numbers are more easy to remember.«

</nitpick>

Hence, for Health Canada you have to specify both limits.

sampleN.TOST(CV = 0.125, theta1 = 0.90, theta2 = 1.12,
             design = "2x2x2") # planned for TaT
# 
# +++++++++++ Equivalence test - TOST +++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover 
# log-transformed data (multiplicative model)
# 
# alpha = 0.05, target power = 0.8
# BE margins = 0.9 ... 1.12 
# True ratio = 0.95,  CV = 0.125
# 
# Sample size (total)
#  n     power
# 68   0.805372

Here we get the same sample size than with the common limits for NTIDs.

Sometimes, if evaluated ‘All at Once’ – by the size of one sequence – smaller:

get.sequences <- function(design) {
  known <- known.designs()[, c(2, 5)]
  ns    <- as.integer(known[known$design == design,
                            "steps"])
  return(ns)
}
make.equal <- function(n, ns) {
  return(as.integer(ns * (n %/% ns + as.logical(n %% ns))))
}
design  <- "3x6x3"
target  <- 0.80
theta0  <- 0.975
CV.min  <- 0.075
CV.max  <- 0.20
CV.step <- 1000
CV      <- seq(CV.min, CV.max, length.out = CV.step)
ns      <- get.sequences(design)
res     <- data.frame(CV = CV, n.EMA = NA_integer_,
                      n.HC = NA_integer_, less = FALSE)
for (i in 1:nrow(res)) {
  n.TaT <- sampleN.TOST(CV = CV[i], theta0 = theta0,
                        targetpower = target, design = "2x2x2",
                        theta1 = 0.90, theta2 = 1/0.90,
                        print = FALSE)[["Sample size"]]
  res$n.EMA[i] <- make.equal(n = n.TaT, ns = ns)
  n.TaT <- sampleN.TOST(CV = CV[i], theta0 = theta0,
                        targetpower = target, design = "2x2x2",
                        theta1 = 0.90, theta2 = 1.12,
                        print = FALSE)[["Sample size"]]
  res$n.HC[i]  <- make.equal(n = n.TaT, ns = ns)
  if (res$n.HC[i] < res$n.EMA[i]) res$less[i] = TRUE
}
cat("design                      :", design,
    "(intended for \u2018Two at a Time\u2019 approach)",
    "\ntarget power                :", target,
    "\ntheta0                      :", theta0,
    "\nCV                          :", CV.min, "\u2013", CV.max,
    "\nSample size for HC < common :",
    sprintf("%.2f%%", 100 * sum(res$less) / CV.step),
    "of cases.\n")
# design                      : 3x6x3 (intended for ‘Two at a Time’ approach) 
# target power                : 0.8 
# theta0                      : 0.975 
# CV                          : 0.075 – 0.2 
# Sample size for HC < common : 6.60% of cases.

The FDA requires for NTIDs tighter batch-release specifications (±5% instead of ±10%). Let’s hope that your product complies and the T/R-ratio will be closer to 1:

sampleN.TOST(CV = 0.125,
             theta0 = 0.975,   # ‘better’ T/R-ratio
             theta1 = 0.90,
             design = "2x2x2") # planned for TaT
# 
# +++++++++++ Equivalence test - TOST +++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover 
# log-transformed data (multiplicative model)
# 
# alpha = 0.05, target power = 0.8
# BE margins = 0.9 ... 1.111111 
# True ratio = 0.975,  CV = 0.125
# 
# Sample size (total)
#  n     power
# 32   0.800218

Substantially lower sample and doable.

top of section ↩︎ previous section ↩︎

    

Difference of Means

Sometimes we are interested in assessing differences of responses instead of their ratios. In such a case we have to set logscale = FALSE. The limits theta1 and theta2 can be expressed in the following ways:

  • As a difference of means relative to the same (underlying) reference mean.
  • In units of the difference of means.

Note that in the former case the units of CV, and theta0 need also to be given relative to the reference mean (specified as ratio).

Let’s estimate the sample size for an equivalence trial of two blood pressure lowering drugs assessing the difference in means of untransformed data (raw, linear scale). In this setup everything has to be given with the same units (i.e., here the assumed difference –5 mm Hg and the lower / upper limits –15 mm Hg / +15 mm Hg systolic blood pressure). Furthermore, we assume a CV of 25 mm Hg.

sampleN.TOST(CV = 20, theta0 = -5,
             theta1 = -15, theta2 = +15,
             design = "2x2x2",
             logscale = FALSE) # planned for TaT
# 
# +++++++++++ Equivalence test - TOST +++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover 
# untransformed data (additive model)
# 
# alpha = 0.05, target power = 0.8
# BE margins = -15 ... 15 
# True diff. = -5,  CV = 20
# 
# Sample size (total)
#  n     power
# 52   0.807468

Sometimes in the literature we find not the CV but the standard deviation of the difference. Say, it is given with 36 mm Hg. We have to convert it to a CV. In all implemented Higher-Order Crossover designs: \(\small{CV=SD/\sqrt{2}}\).

SD.delta <- 36
sampleN.TOST(CV = SD.delta / sqrt(2), theta0 = -5,
             theta1 = -15, theta2 = +15,
             design = "2x2x2",
             logscale = FALSE) # planned for TaT
# 
# +++++++++++ Equivalence test - TOST +++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover 
# untransformed data (additive model)
# 
# alpha = 0.05, target power = 0.8
# BE margins = -15 ... 15 
# True diff. = -5,  CV = 25.45584
# 
# Sample size (total)
#  n     power
# 82   0.801691

Note that other software packages (e.g., PASS, nQuery, StudySize, …) require the standard deviation of the difference as input.

previous section ↩︎

    

Ratio of Means

It would be a fundamental flaw if responses are assumed to a follow a normal distribution and not – like above – assess their difference and at the end instead giving an estimate of the difference, calculate the ratio of the test- and reference-means.

In such a case Fieller’s (‘fiducial’) confidence interval41 has to be employed.
Note that in the functions sampleN.RatioF()and power.RatioF() alpha = 0.025 is the default, since it is intended for studies with clinical endpoints, where the 95% confidence interval is usually used for equivalence testing.42 Only the ‘Two at a Time’ approach is implemented, i.e., you have to specify design = "2x2". Sometimes we have to round the sample size up (see the interlude). Note that we need additionally the between-subject CV (here arbitrarily twice the CV).

planned  <- "3x6x3"
CV       <- 0.25
CVb      <- 2 * CV
get.sequences <- function(design) {
  known <- known.designs()[, c(2, 5)]
  ns    <- as.integer(known[known$design == design,
                            "steps"])
  return(ns)
}
make.equal <- function(n, ns) {
  return(as.integer(ns * (n %/% ns + as.logical(n %% ns))))
}
ns <- get.sequences(planned)
n  <- sampleN.RatioF(CV = CV, CVb = CVb, design = "2x2",
                     print = FALSE)[["Sample size"]]
n  <- make.equal(n = n, ns = ns)
df <- data.frame(Design = planned, Eval = "2x2",
                 alpha = 0.025, CV = CV, CVb = CVb, n = n)
df$power <- suppressMessages(
              power.RatioF(CV = CV, CVb = CVb,
                           design = "2x2", n = n))
names(df)[6:7] <- c("Sample size", "Achieved power")
print(df, row.names = FALSE)
#  Design Eval alpha   CV CVb Sample size Achieved power
#   3x6x3  2x2 0.025 0.25 0.5          42      0.8021525

previous section ↩︎

    

Methods

He who seeks for methods without having a definite problem in mind seeks in the most part in vain.

With a few exceptions (i.e., simulation-based methods), in PowerTOST the default method = "exact" implements Owen’s Q function43 which is also used in SAS’ Proc Power.

Other implemented methods are "mvt" (based on the bivariate non-central t-distribution), "noncentral" / "nct" (noncentral t-distribution), and "shifted" / "central" (shifted central t-distribution). Although "mvt" is also exact, it may have a somewhat lower precision compared to Owen’s Q and has a much longer run-time.

Let’s compare them.

CV      <- 0.25
methods <- c("exact", "mvt", "noncentral", "shifted")
appr    <- c("‘Two at a Time’", "‘All at Once’")
df      <- data.frame(approach = rep(appr, each = 4), method = methods,
                      n = NA_integer_, power = NA_real_)
for (i in 1:nrow(df)) {
  if (df$approach[i] == "‘Two at a Time’") {
    df[i, 3:4] <- sampleN.TOST(CV = CV, design = "2x2",
                               method = df$method[i], print = FALSE)[7:8]
}else {
    df[i, 3:4] <- sampleN.TOST(CV = CV, design = "3x6x3",
                               method = df$method[i], print = FALSE)[7:8]
  }
}
df$power <- round(df$power, 5)
print(df, row.names = FALSE, right = FALSE)
#  approach        method     n  power  
#  ‘Two at a Time’ exact      28 0.80744
#  ‘Two at a Time’ mvt        28 0.80744
#  ‘Two at a Time’ noncentral 28 0.80744
#  ‘Two at a Time’ shifted    28 0.80303
#  ‘All at Once’   exact      30 0.84301
#  ‘All at Once’   mvt        30 0.84301
#  ‘All at Once’   noncentral 30 0.84301
#  ‘All at Once’   shifted    30 0.84113

Power approximated by the shifted central t-distribution is generally slightly lower compared to the others. Hence, if used in sample size estimations, occasionally one more complete sequence is ‘required’.
Therefore, I recommend to use "method = shifted" / "method = central" only for comparing with old results (literature, own studies).

previous section ↩︎

Q & A

  • Q: Can we use R in a regulated environment and is PowerTOST validated?
    A: See this document44 about the acceptability of Base R and its SDLC.45
    R is updated every couple of months with documented changes46 and maintaining a bug-tracking system.47 I re­commend to use always the latest release.
    The authors of PowerTOST tried to do their best to provide reliable and valid results. The package’s NEWS documents the development of the package, bug-fixes, and introduction of new methods. Issues are tracked at Git­Hub (as of today none is still open). So far the package had >102,000 downloads. Therefore, it is extremely unlikely that bugs were not detected given its large user base.
    The ultimate re­spon­si­bil­i­ty of validating any software (yes, of SAS as well…) lies in the hands of the user.48 49 50 As an aside, statisticians of the FDA regularly use R themselves…

  • Q: Which of the methods should we use in our daily practice?
    A: method = exact. Full stop. Why rely on approximations? Since it is the default in sampleN.TOST() and power.TOST(), you don’t have to give this argument (saves keystrokes).

  • Q: I fail to understand your example about dropouts. We finish the study with 30 eligible subjects as desired. Why is the dropout-rate ~17% and not the anticipated 10%?
    A: That’s due to rounding up the calculated adjusted sample size (35.29…) to the next complete sequence (36). If you manage it to dose fractional subjects (I can’t) your dropout rate would indeed equal the anticipated one: 100 × (1 – 30/35.29…) = 10%.

  • Q: Do we have to worry about unbalanced sequences?
    A: sampleN.TOST() will always give the total number of subjects for balanced sequences.
    If you are interested in post hoc power, give the sample size as a vector, i.e., power.TOST(..., n = c(foo, bar, ...), where foo, bar, ... are the number of subjects in each sequence.

  • Q: Is it possible to simulate power of studies?
    A: That’s not necessary, since the available methods provide analytical solutions. However, if you don’t trust them, simulations are possible with the function power.TOST.sim(), which employs the distributional properties:

\(\small{\sigma^2}\) follows a \(\small{\chi^2}\)-distribution with \(\small{n-2}\) degrees of freedom and \(\small{\log_{e}(\theta_0)}\) follows a normal distribution).51

Convergence takes a while. Empiric power, its standard error, its relative error compared to the exact power, and execution times on a Xeon E3-1245v3 @ 3.40GHz (1/4 cores) 16GB RAM with R 4.3.1 on Win­dows 7 build 7601, Service Pack 1, Universal C Run­time 10.0.10240.16390; 3×3 Latin Squares evaluated by ‘All at Once’ for simplicity:

CV    <- 0.25
des   <- "3x3"
n     <- sampleN.TOST(CV = CV, design = des,
                      print = FALSE)[["Sample size"]]
nsims <- c(1e5, 5e5, 1e6, 5e6, 1e7, 5e7, 1e8)
exact <- power.TOST(CV =CV, n = n, design = des)
df    <- data.frame(simulations = nsims,
                    exact = rep(exact, length(nsims)),
                    simulated = NA, SE = NA, RE = NA)
for (i in 1:nrow(df)) {
  start.time      <- proc.time()[[3]]
  df$simulated[i] <- power.TOST.sim(CV = CV, n = n,
                                    design = des,
                                    nsims = nsims[i])
  df$secs[i]      <- proc.time()[[3]] - start.time
  df$SE[i]        <- sqrt(0.5 * df$simulated[i] / nsims[i])
  df$RE           <- 100 * (df$simulated - exact) / exact
}
df$exact       <- signif(df$exact, 5)
df$SE          <- formatC(df$SE, format = "f", digits = 5)
df$RE          <- sprintf("%+.4f%%", df$RE)
df$simulated   <- signif(df$simulated, 5)
df$simulations <- formatC(df$simulations, format = "f",
                          digits = 0, big.mark=",")
names(df)[c(1, 3)] <- c("sim\u2019s", "sim\u2019d")
print(df, row.names = FALSE)
#        sim’s   exact   sim’d      SE       RE   secs
#      100,000 0.80349 0.80222 0.00200 -0.1585%   0.12
#      500,000 0.80349 0.80333 0.00090 -0.0209%   0.58
#    1,000,000 0.80349 0.80338 0.00063 -0.0145%   1.19
#    5,000,000 0.80349 0.80353 0.00028 +0.0040%   5.92
#   10,000,000 0.80349 0.80342 0.00020 -0.0094%  11.86
#   50,000,000 0.80349 0.80353 0.00009 +0.0046%  59.28
#  100,000,000 0.80349 0.80350 0.00006 +0.0007% 119.53

With an increasing number of simulations values become more precise (the standard error decreases) as well as more accurate (the relative error decreases).


Fig. 2 Empiric power for n = 28 (3×3 design evaluated ‘All at Once’).

In the Monte Carlo method values approach the exact one asymptotically.

  • Q: I still have questions. How to proceed?
    A: The preferred method is to register at the BEBA Forum and post your question in the category Power / Sample size  (please read the Forum’s Policy first).
    You can contact me at [email protected]. Be warned – I will charge you for anything beyond most basic questions.

previous section ↩︎

Acknowledgment

Member Shuanghe of the BEBA-Forum for discovering a bug in my script for extracting IBDs (they were not necessarily balanced).

Licenses

CC BY 4.0 Helmut Schütz 2023
R and packages GPL 3.0, klippy MIT, pandoc GPL 2.0.
1st version March 17, 2021. Rendered October 1, 2023 17:55 CEST by rmarkdown via pandoc in 0.84 seconds.

Footnotes and References


  1. Labes D, Schütz H, Lang B. PowerTOST: Power and Sample Size for (Bio)Equivalence Studies. Package version 1.5.5. 2023-06-27. CRAN.↩︎

  2. Labes D. randomizeBE: Create a Random List for Crossover Studies. Package version 0.3.5. 2019-08-24. CRAN.↩︎

  3. Schütz H. Average Bioequivalence. 2022-02-21. CRAN.↩︎

  4. Labes D, Schütz H, Lang B. Package ‘PowerTOST’. February 21, 2022. CRAN.↩︎

  5. Schuirmann DJ. Two at a Time? Or All at Once? International Biometric Society, Eastern North American Region, Spring Meeting. Pittsburgh, PA. March 28 – 31, 2004. Abstract.↩︎

  6. Senn S. Guernsey McPearson’s Drug Development Dictionary. 21 April 2020. Online.↩︎

  7. Hoenig JM, Heisey DM. The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. Am Stat. 2001; 55(1): 19–24. doi:10.1198/000313001300339897. Open Access Open Access.↩︎

  8. In short: There is no statistical method to ‘correct’ for unequal carryover. It can only be avoided by design, i.e., a sufficiently long washout between periods. According to the guidelines subjects with pre-dose concentrations > 5% of their Cmax can by excluded from the comparison if stated in the protocol.↩︎

  9. Especially important for drugs which are auto-inducers or -inhibitors and biologics.↩︎

  10. Senn S. Statistical Issues in Drug Development. Wiley, 2nd ed 2007.↩︎

  11. Julious SA. Sample Sizes for Clinical Trials. Chapter 7.2. Boca Raton: CRC Press; 2010.↩︎

  12. Common batch release specifications are ±10% of the declared content. Of course, you will try to find a batch of the reference product which is as close as possible to the test (the EMA recommends ≤ 5%). How­ever, keep analytical (in)accuracy and (im)precision in mind. Furthermore, the analytical method was only validated for the test. Try to get a CoA from the originator… Good luck. You never can be sure.↩︎

  13. Senn S. Statistical Issues in Drug Development. Chichester: John Wiley; 2004.↩︎

  14. Zhang P. A Simple Formula for Sample Size Calculation in Equivalence Studies. J Biopharm Stat. 2003; 13(3): 529–38. doi:10.1081/BIP-120022772.↩︎

  15. Let us assume that the – whilst unknown – within-subject \(\small{CV\textrm{s}}\) of the two reference treatments are different \((\small{CV_{\textrm{w}_{\textrm{R}1}}\neq CV_{\textrm{w}_{\textrm{R}2}}\,}\)). Since in the ‘All at Once’ approach the pooled variance is used, the confidence interval of the treatment comparisons may be wider than in the ‘Two at a Time’ approach.↩︎

  16. Since e.g., in the comparison of products from different regions the populations differ and the risk for both is still \(\small{\leq \alpha}\) – apart tourists buying the product in another region…↩︎

  17. ICH. Bioequivalence for Immediate Release Solid Oral Dosage Forms. M13A – Draft. 20 De­cem­ber 2022. Online.↩︎

  18. EMA, CHMP. Guideline on the Investigation of Bioequivalence. London. 20 January 2010. Online.↩︎

  19. David Brown. Presentation at the 3rd EGA Symposium on Bioequivalence. London. June 2010. Slides.↩︎

  20. EGA. Revised EMA Bioequivalence Guideline. Questions & Answers. Online.↩︎

  21. FDA, CDER. Draft Guidance for Industry. Statistical Approaches to Establishing Bioequivalence. Silver Spring. December 2022. Revision 1. Download.↩︎

  22. FDA (CDER, OGD). Navigating the First ICH Generic Drug Draft Guideline “M13A Bioequivalence for Immediate-Release Solid Oral Dosage Forms”. SBIA Webinar. May 2, 2023. Online.↩︎

  23. Cochran WG, Cox GM. Experimental Designs. New York: Wiley; 2nd ed. 1957. Chapter 9. p. 376–95.↩︎

  24. Dean A, Voss D, Dragulić D. Design and Analysis of Experiments. New York: Springer; 2nd ed. 2017. Chapter 11. p. 349–97.↩︎

  25. This effect is not used in the common model. Unequal carryover – which would bias the treatment estimate – can only be avoided by a sufficiently long washout. For details see another artice.↩︎

  26. Council of the EEC. On the approval of the rules for conducting research of bioequivalence of drugs within the framework of the EEU. November 3, 2016; amended September 4, 2020.↩︎

  27. Mandal BN. ibd: Incomplete Block Designs. Package version 1.5. 2019-04-24. CRAN.↩︎

  28. Medicines for Europe. 2nd Bioequivalence Workshop. Session 2 – ICH M13 – Bioequivalence for IR solid oral dosage forms. Brussels. 26 April 2023.↩︎

  29. D’Angelo P. Testing for Bioequivalence in Higher‐Order Crossover Designs: Two‐at‐a‐Time Principle Versus Pooled ANOVA. 2nd Conference of the Global Bioequivalence Harmonisation Initiative. Rockville, MD. 15–16 September, 2016.↩︎

  30. Doyle AC. The Adventures of Sherlock Holmes. A Scandal in Bohemia. 1892. p. 3.↩︎

  31. Schütz H. Sample Size Estimation in Bioequivalence. Evaluation. BEBA Forum. 2020-10-23.↩︎

  32. Lenth RV. Two Sample-Size Practices that I Don’t Recommend. 24 October 2000. Online.↩︎

  33. Quoting my late father: »If you believe, go to church.«↩︎

  34. Ring A, Lang B, Kazaroho C, Labes D, Schall R, Schütz H. Sample size determination in bioequivalence studies using statistical assurance. Br J Clin Pharmacol. 2019; 85(10): 2369–77. doi:10.1111/bcp.14055.  Open Access.↩︎

  35. Berger RL, Hsu JC. Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets. Stat Sci. 1996; 11(4): 283–302. JSTOR:2246021.↩︎

  36. Zeng A. The TOST confidence intervals and the coverage probabilities with R simulation. March 14, 2014. Online.↩︎

  37. With \(\small{\alpha=0.5}\) the width of the CI is zero and the test effectively reduces to a pass|fail assessment.↩︎

  38. Since different patient populations are concerned, the Type I Error is not affected.↩︎

  39. EMEA, CHMP. Guideline on multiplicity issues in clinical trials – Draft. London. 15 December 2016. Online.↩︎

  40. The U.S. FDA and China’s CDE recommend Reference-Scaled Average Bioequivalence (RSABE) and a comparison of the within-subject variances of treatments, which requires a fully replicated design. It is covered in another article.↩︎

  41. Fieller EC. Some Problems In Interval Estimation. J Royal Stat Soc B. 1954; 16(2): 175–85. JSTOR:2984043.↩︎

  42. EMEA, CPMP. Points to Consider on Switching between Superiority and Non-Inferiority. London. 27 July 2000. Online.↩︎

  43. Owen DB. A special case of a bivariate non-central t-distribution. Biometrika. 1965; 52(3/4): 437–46. doi:10.2307/2333696.↩︎

  44. The R Foundation for Statistical Computing. A Guidance Document for the Use of R in Regulated Clinical Trial En­vi­ron­ments. Vienna. October 18, 2021. Online.↩︎

  45. The R Foundation for Statistical Computing. R: Software Development Life Cycle. A Description of R’s Development, Testing, Release and Maintenance Processes. Vienna. October 18, 2021. Online.↩︎

  46. The R Foundation. R News. 2023-06-16. Online.↩︎

  47. Bugzilla. R bug tracking system. Online.↩︎

  48. FDA. Statistical Software Clarifying Statement. May 6, 2015. Download.↩︎

  49. WHO. Guidance for organizations performing in vivo bioequivalence studies. Geneva. May 2016. Technical Report Series No. 996, Annex 9. Section 4. Online.↩︎

  50. ICH. Good Clinical Practice (GCP). E6(R3) – Draft. 19 May 2023. Section 4.5. Online.↩︎

  51. Zheng C, Wang J, Zhao L. Testing bioequivalence for multiple formulations with power and sample size calculations. Pharm Stat. 2012; 11(4): 334–41. doi:10.1002/pst.1522.↩︎