Consider allowing JavaScript. Otherwise, you have to be proficient in reading since formulas will not be rendered. Furthermore, the table of contents in the left column for navigation will not be available and code-folding not supported. Sorry for the inconvenience.

Examples in this article were generated with  4.2.1 by the package PowerTOST.1

• The right-hand badges give the ‘level’ of the respective section.

1. Basics about sample size methodology and study designs – requiring no or only limited statistical expertise.

1. These sections are the most important ones. They are – hopefully – easily comprehensible even for novices.

1. A somewhat higher knowledge of statistics and/or R is required. May be skipped or reserved for a later reading.

1. An advanced knowledge of statistics and/or R is required. Not recommended for beginners in particular.
• Click to show / hide R code.
Abbreviation Meaning
$$\small{\alpha}$$ Level of the test
BE Bioequivalence
CI Confidence Interval
$$\small{CV}$$ Within-subject Coefficient of Variation (paired, crossover, replicate designs),
total CV (parallel design)
$$\small{\Delta}$$ Clinically relevant difference
$$\small{H_0}$$ Null hypothesis
$$\small{H_1}$$ Alternative hypothesis (also $$\small{H_\textrm{a}}$$)
$$\small{\theta_0=\mu_\textrm{T}/\mu_\textrm{R}}$$ True T/R-ratio
$$\small{\theta_1,\;\theta_2}$$ Lower and upper limits of the acceptance range (BE margins)

# Introduction

What is the best approach to deal with the limited capacity of a clinical site?

The most simple – and preferable – approach is to find a clinical site which is able to accommodate all subjects at once. If this is not possible, subjects could be allocated to multiple groups (aka cohorts) or sites. Whether or not a group- (site-) term should by included in the statistical model is still the topic of heated debates. In the case of multi-site studies regulators likely require a modification of the model.2

Since in replicate designs less subjects are required to achieve the same power than in conventional 2×2×2 crossover design, sometimes multi-group or multi-site studies can be avoided.

# Preliminaries

A basic knowledge of R is required. To run the scripts at least version 1.4.8 (2019-08-29) of PowerTOST is suggested. Any version of R would likely do, though the current release of PowerTOST was only tested with version 4.1.3 (2022-03-10) and later.
All scripts were run on a Xeon E3-1245v3 @ 3.40GHz (1/4 cores) 16GB RAM with R 4.2.1 on Windows 7 build 7601, Service Pack 1, Universal C Runtime 10.0.10240.16390.

The examples deal primarily with the 2×2×2 crossover design ($$\small{\textrm{TR}|\textrm{RT}}$$)3 but the concept is applicable to any kind of cross­over (Higher-Order, replicate designs) or parallel designs assessed for equivalence.

In the following I consider groups, i.e., studies performed in a single site. However, the concept is applicable for studies performed in multiple sites as well. Studies in groups in multiple sites are out of scope of this article.

# Hypotheses

The lower and upper limits of the bioequivalence (BE) range $$\small{\{\theta_1,\theta_2\}}$$ are calculated based on the ‘clinically not relevant difference’ $$\small{\Delta}$$ $\left\{\theta_1=100\,(1-\Delta),\,\theta_2=100\,(1-\Delta)^{-1}\right\}\tag{1}$ Commonly $$\small{\Delta}$$ is set to 0.20. For narrow therapeutic index drugs (the EMA and other jurisdictions) $$\small{\Delta}$$ is set to 0.10, and for $$\small{C_\textrm{max}}$$ (Russian Federation, the EEU, South Africa) $$\small{\Delta}$$ is set to 0.25. We obtain: $\begin{matrix}\tag{2} \left\{\theta_1=80.00\%,\,\theta_2=125.00\%\right\}\\ \left\{\theta_1=90.00\%,\,\theta_2=111.1\dot{1}\%\right\}\\ \left\{\theta_1=75.00\%,\,\theta_2=133.3\dot{3}\%\right\} \end{matrix}$

Conventionally BE is assessed by the confidence interval inclusion approach: $H_0:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}\not\subset\left\{\theta_1,\theta_2\right\}\:vs\:H_1:\theta_1<\frac{\mu_\textrm{T}}{\mu_\textrm{R}}<\theta_2\tag{3}$

As long as the $$\small{100\,(1-2\,\alpha)}$$ confidence interval lies entirely within pre-specified limits $$\small{\{\theta_1,\theta_2\}}$$, the null hypothesis of inequivalence in $$\small{(3)}$$ is rejected and the alternative hypothesis of equivalence in $$\small{(3)}$$ is accepted.

# Models

The model for a crossover design in average bioequivalence as stated in guidelines is \eqalign{response&|\;sequence,\,subject(sequence),\\ &\phantom{|}\;period,\,treatment}\tag{4} Most regulations recommend an ANOVA, i.e., all effects fixed. The FDA and Health Canada recommend a mixed-effects model, i.e., to specify $$\small{subject(sequence)}$$ as a random effect and all others fixed.
It must be mentioned that in comparative bioavailability studies subjects are usually uniquely coded. Hence, the nested term $$\small{subject(sequence)}$$ in $$\small{(4)}$$ is a bogus one4 and could be replaced by the simple $$\small{subject}$$ as well. See also another article.

If the study is performed in multiple groups or sites, model $$\small{(4)}$$ can be modified to \eqalign{response&|\;F,\,\;sequence,\,subject(F\times sequence),\\ &\phantom{|}\;period(F),F\times sequence,\,treatment\small{\textsf{,}}}\tag{5} where $$\small{F}$$ is the code for $$\small{group}$$ or $$\small{site}$$, respectively.

When $$\small{N_\text{F}}$$ is the number of groups or sites, in $$\small{(5)}$$ there are $$\small{N_\text{F}-1}$$ less residual degrees of freedom than in $$\small{(4)}$$. The table below gives the designs implemented in PowerTOST.

$\small{\begin{array}{l|lccccrr} \phantom{00000000}\textsf{design} & \phantom{00}\textsf{code} & \textsf{t} & \textsf{s} & \textsf{p} & \textsf{m} & (4)\phantom{0} & (5)\phantom{00000000}\\\hline \textsf{Parallel} & \textsf{parallel} & 2 & - & 1 & 1 & N-2 & N-2-(N_\text{F}-1)\\ \textsf{Paired means} & \textsf{paired} & 2 & - & 2 & 2 & N-1 & N-1-(N_\text{F}-1)\\ \textsf{Crossover} & \textsf{2}\times\textsf{2}\times\textsf{2} & 2 & 2 & 2 & \tfrac{1}{2} & N-2 & N-2-(N_\text{F}-1)\\ \textsf{2-sequence 3-period full replicate} & \textsf{2}\times\textsf{2}\times\textsf{3} & 2 & 2 & 3 & \tfrac{3}{8} & 2N-3 & 2N-3-(N_\text{F}-1)\\ \textsf{2-sequence 4-period full replicate} & \textsf{2}\times\textsf{2}\times\textsf{4} & 2 & 2 & 4 & \tfrac{1}{4} & 3N-4 & 3N-4-(N_\text{F}-1)\\ \textsf{4-sequence 4-period full replicate} & \textsf{2}\times\textsf{4}\times\textsf{4} & 2 & 4 & 4 & \tfrac{1}{16} & 3N-4 & 3N-4-(N_\text{F}-1)\\ \textsf{Partial replicate} & \textsf{2}\times\textsf{3}\times\textsf{3} & 2 & 3 & 3 & \tfrac{1}{6} & 2N-2 & 2N-3-(N_\text{F}-1)\\ \textsf{Balaam}'\textsf{s} & \textsf{2}\times\textsf{4}\times\textsf{2} & 2 & 4 & 2 & \tfrac{1}{2} & N-2 & N-2-(N_\text{F}-1)\\\textsf{Latin Square or Williams}' & \textsf{3}\times\textsf{3} & 3 & 3 & 3 & \tfrac{2}{9} & 2N-4 & 2N-4-(N_\text{F}-1)\\ \textsf{Williams}' & \textsf{3}\times\textsf{6}\times\textsf{3} & 3 & 6 & 3 & \tfrac{1}{18} & 2N-4 & 2N-4-(N_\text{F}-1)\\ \textsf{Latin Square or Williams}' & \textsf{4}\times\textsf{4} & 4 & 4 & 4 & \tfrac{1}{8} & 3N-6 & 3N-6-(N_\text{F}-1)\\ \end{array}}$ $$\small{\textsf{code}}$$ is the design-argument in the functions of PowerTOST. $$\small{\textsf{t, s, p}}$$ are the number of treatments, sequences, and periods, respectively. $$\small{\textsf{m}}$$ is the multiplier in the radix of $$\small{(6)}$$ and $$\small{N}$$ is the total number of subjects, i.e., $$\small{N=\sum_{i=1}^{i=s}n_i}$$.

The back-transformed $$\small{1-2\,\alpha}$$ Confidence Interval (CI) is calculated by $CI=100\,\exp\left(\log_{e}\overline{x}_\text{T}-\log_{e}\overline{x}_\text{R}\mp t_{df,\alpha}\sqrt{m \times\widehat{s^2}\sum_{i=1}^{i=s}\frac{1}{n_i}}\right)\small{\textsf{,}}\tag{6}$ where $$\widehat{s^2}$$ is the residual variance of the model and $$\small{n_i}$$ is the number of subjects in the $$\small{i^\text{th}}$$ of $$\small{s}$$ sequences.

# Examples

library(PowerTOST) # attach it to run the examples

In a 2×2×2 crossover design the residual degrees of freedom in the models are \eqalign{(4):df&=N-2\\ (5):df&=N-2-(N_\text{F}-1)\small{\textsf{,}}}\tag{7} where $$\small{N}$$ is total number of subjects. Therefore, the CI by $$\small{(6)}$$ for given $$\small{N}$$ and $$\small{\widehat{s^2}}$$ by $$\small{(5)}$$ will always be wider (more conservative) than the one by $$\small{(4)}$$ due to the fewer degrees of freedom and hence, larger $$\small{t}$$-value.
However, unless the sample size is not small and the number of groups (or sites) large, the difference in widths of the CIs is generally neglegible.

design <- "2x2x2"
theta0 <- 0.95
CV     <- 0.335
var    <- CV2mse(CV)
n      <- sampleN.TOST(CV = CV, theta0 = theta0, design = design,
print = FALSE)[["Sample size"]]
n1     <- n2 <- n / 2
NF     <- c(2:4, 6, 8)
x      <- data.frame(n = n, NF = NF,
df.4 = n - 2,
CL.lo.4 = NA_real_, CL.hi.4 = NA_real_,
width.4 = NA_real_,
df.5 = n - 2 - (NF - 1),
CL.lo.5 = NA_real_, CL.hi.5 = NA_real_,
width.5 = NA_real_)
for (i in seq_along(NF)) {
x[i, 4:5] <- exp(log(theta0) - log(1) + c(-1, +1) *
qt(1 - 0.05, df = x$df.4[i]) * sqrt(var / 2 * (1 / n1 + 1 / n2))) x[i, 8:9] <- exp(log(theta0) - log(1) + c(-1, +1) * qt(1 - 0.05, df = x$df.5[i]) *
sqrt(var / 2 * (1 / n1 + 1 / n2)))
}
x$width.4 <- x$CL.hi.4 - x$CL.lo.4 x$width.5 <- x$CL.hi.5 - x$CL.lo.5
diffs    <- x$width.5 - x$width.4
names(x)[c(1, 3:5, 7:9)] <- c("N", "df (4)", "CL.lo (4)", "CL.hi (4)",
"df (5)", "CL.lo (5)", "CL.hi (5)")
x[, 4] <- sprintf("%.3f%%", 100 * x[, 4])
x[, 5] <- sprintf("%.3f%%", 100 * x[, 5])
x[, 8] <- sprintf("%.3f%%", 100 * x[, 8])
x[, 9] <- sprintf("%.3f%%", 100 * x[, 9])
print(x[, c(1:5, 7:9)], row.names = FALSE)
cat("Maximum difference in the width of CIs:",
sprintf("%.2g%%.\n", 100 * max(diffs)))
#   N NF df (4) CL.lo (4) CL.hi (4) df (5) CL.lo (5) CL.hi (5)
#  48  2     46   84.955%  106.232%     45   84.951%  106.238%
#  48  3     46   84.955%  106.232%     44   84.946%  106.243%
#  48  4     46   84.955%  106.232%     43   84.942%  106.249%
#  48  6     46   84.955%  106.232%     41   84.932%  106.262%
#  48  8     46   84.955%  106.232%     39   84.920%  106.276%
# Maximum difference in the width of CIs: 0.079%.

If we have a study with 48 subjects in a 2×2×2 crossover design performed in two to eight groups (or sites), the difference in the widths of CIs is negligible in any case.

In a two-sequence four-period full replicate design the residual degrees of freedom in the models are \eqalign{(4):df&=3N-4\\ (5):df&=3N-4-(N_\text{F}-1)\small{\textsf{,}}}\tag{8}where $$\small{N}$$ is total number of subjects. We need only half of the subjects because power depends on the number of treatments, which is the same like in the 2×2×2 crossover. Again, the CI by $$\small{(6)}$$ for given $$\small{N}$$ and $$\small{\widehat{s^2}}$$ by $$\small{(5)}$$ will always be wider than the one by $$\small{(4)}$$.

design <- "2x2x4"
theta0 <- 0.95
CV     <- 0.335
var    <- CV2mse(CV)
n      <- sampleN.TOST(CV = CV, theta0 = theta0, design = design,
print = FALSE)[["Sample size"]]
n1     <- n2 <- n / 2
NF     <- c(2:4, 6)
x      <- data.frame(n = n, NF = NF,
df.4 = 3 * n - 4,
CL.lo.4 = NA_real_, CL.hi.4 = NA_real_,
width.4 = NA_real_,
df.5 = 3 * n - 4 - (NF - 1),
CL.lo.5 = NA_real_, CL.hi.5 = NA_real_,
width.5 = NA_real_)
for (i in seq_along(NF)) {
x[i, 4:5] <- exp(log(theta0) - log(1) + c(-1, +1) *
qt(1 - 0.05, df = x$df.4[i]) * sqrt(var / 4 * (1 / n1 + 1 / n2))) x[i, 8:9] <- exp(log(theta0) - log(1) + c(-1, +1) * qt(1 - 0.05, df = x$df.5[i]) *
sqrt(var / 4 * (1 / n1 + 1 / n2)))
}
x$width.4 <- x$CL.hi.4 - x$CL.lo.4 x$width.5 <- x$CL.hi.5 - x$CL.lo.5
diffs     <- x$width.5 - x$width.4
names(x)[c(1, 3:5, 7:9)] <- c("N", "df (4)", "CL.lo (4)", "CL.hi (4)",
"df (5)", "CL.lo (5)", "CL.hi (5)")
x[, 4] <- sprintf("%.3f%%", 100 * x[, 4])
x[, 5] <- sprintf("%.3f%%", 100 * x[, 5])
x[, 8] <- sprintf("%.3f%%", 100 * x[, 8])
x[, 9] <- sprintf("%.3f%%", 100 * x[, 9])
print(x[, c(1:5, 7:9)], row.names = FALSE)
cat("Maximum difference in the width of CIs:",
sprintf("%.2g%%.\n",100 * max(diffs)))
#   N NF df (4) CL.lo (4) CL.hi (4) df (5) CL.lo (5) CL.hi (5)
#  24  2     68   85.018%  106.154%     67   85.016%  106.156%
#  24  3     68   85.018%  106.154%     66   85.014%  106.159%
#  24  4     68   85.018%  106.154%     65   85.012%  106.161%
#  24  6     68   85.018%  106.154%     63   85.008%  106.167%
# Maximum difference in the width of CIs: 0.023%.

Here the difference in the widths of CIs is even smaller than in the 2×2×2 crossover due to the larger degrees of freedom.

The function power.TOST.sds() of PowerTOST supports simulations of $$\small{(5)}$$ of studies in a 2×2×2 crossover, as well as in full and partial replicate designs.

The script supports sample size estimation for Average Bioequivalence and generation of groups based on the clinical capacity, where sequences within groups will be balanced. The estimated sample size can be adjusted based on an anticipated dropout-rate. If this is selected (do.rate > 0) two simulations are performed: One for the adjusted sample size (optimistic: no dropouts) and one for the estimated sample size (pessimistic: dropout-rate realized).
Two options for the generation of groups are supported by the logical argument equal:

1. FALSE: Attempts to generate at least one group with the maximum size of the clinical site (default).
2. TRUE:    Attempts to generate equally sized groups.

Last but not least power by $$\small{(5)}$$ is compared to exact power by $$\small{(4)}$$.
As we have seen in the table above, due to its lower degrees of freedom power of model $$\small{(5)}$$ should always be lower than the one of model $$\small{(4)}$$. If you get a positive change value in the comparison, it is a simulation artifact. In such a case, increase the number of simulations (nsims = 1e6 or higher). Cave: 168 LOC.

sim.groups <- function(CV, theta0 = 0.95, theta1, theta2, target = 0.80,
design = "2x2x2", capacity, equal = FALSE,
gmodel = 2, do.rate = 0, nsims = 1e5L, show = TRUE) {
##########################################################
# Explore the impact on power of a group model com-      #
# pared to the conventional model of pooled data via     #
# simulations.                                           #
# ------------------------------------------------------ #
# capacity: maximum capacity of the clinical site        #
# equal:    TRUE : tries to generate equally sized       #
#                  groups                                #
#           FALSE: tries to get at least one group with  #
#                  the maximum size of the clinical site #
#                  (default)                             #
# do.rate:  anticipated dropout-rate; if > 0 a second    #
#           simulation is performed based on the adjust- #
#           ed sample size                               #
##########################################################
if (missing(theta1) & missing(theta2)) theta1 <- 0.80
if (missing(theta1)) theta1 <- 1 / theta2
if (missing(theta2)) theta2 <- 1 / theta1
if (theta0 < theta1 | theta0 > theta2)
stop("theta0 must be within {theta1, theta2}.", call. = FALSE)
if (theta0 == theta1 | theta0 == theta2)
stop("Simulation of the Type I Error not supported.",
"\n       Use power.TOST.sds() directly.", call. = FALSE)
if (!design %in% c("2x2", "2x2x2", "2x3x3", "2x2x4", "2x2x3"))
stop("Design \"", design, "\" not implemented.", call. = FALSE)
ns <- as.integer(substr(design, 3, 3))
make.equal <- function(n, ns) {
# make equally sized sequences
return(as.integer(ns * (n %/% ns + as.logical(n %% ns))))
}
nadj <- function(n, do.rate, ns) {
return(as.integer(make.equal(n / (1 - do.rate), ns)))
}
grouping <- function(capacity, n, design, equal, do.rate) {
# based on the sample size and capacity, calculate the
# number of groups and subjects / group
if (do.rate == 0) {
x <- n
stop.txt <- paste("The sample size of", x)
}else {
stop.txt <- paste("The adjusted sample size of", x)
}
if (x <= capacity) {
stop(stop.txt, " does not exhaust the clinical capacity.",
call. = FALSE)
}else {
# split sample size into >=2 groups based on capacity
# TODO: Handle a case where with no dropouts <= capacity (no grouping)
#       but with the adjusted sample size two groups are required
if (equal) {# attempt to make all groups equally sized
grps <- ceiling(n / capacity)
tmp  <- rep(n / grps, grps)
ngrp <- make.equal(tmp, ns)
if (!isTRUE(all.equal(tmp, ngrp)))
message("Note: Imbalanced sequences in groups (",
paste(round(tmp, 0), collapse = "|"), ") corrected.\n")
if (sum(ngrp) > n) ngrp[length(ngrp)] <- n-sum(ngrp[-1])
}else {    # at least one group = capacity
ngrp    <- rep(0, ceiling(n / capacity))
grps    <- length(ngrp)
ngrp[1] <- capacity
for (j in 2:grps) {
n.tot <- sum(ngrp) # what we have so far
if (n.tot + capacity <= n) {
ngrp[j] <- capacity
}else {
ngrp[j] <- n - n.tot
}
}
}
}
return(ngrp = list(grps = length(ngrp), ngrp = ngrp))
}
if (equal) {
txt <- "Attempting to generate equally sized groups."
}else {
txt <- paste("Attempting to have at least one group with",
"\nthe maximum capacity of the clinical site.")
}
txt    <- paste(txt, "\nCV                   :", sprintf("%.4f", CV),
"\ntheta0               :", sprintf("%.4f", theta0),
"\nBE-limits            :",
sprintf("%.4f – %.4f", theta1, theta2),
"\nTarget power         :", sprintf("%.2f", target),
"\nDesign               :", design,
"\nClinical capacity    :", capacity)
tmp    <- sampleN.TOST(CV = CV, theta0 = theta0, theta1 = theta1,
theta2 = theta2, targetpower = target,
design = design, print = FALSE)
res    <- data.frame(n = NA_integer_, grps = NA_integer_,
n.grp = NA_integer_, m.1 = NA_real_, m.2 = NA_real_,
change = NA_real_)
res$n <- tmp[["Sample size"]] res[4] <- tmp[["Achieved power"]] x <- grouping(capacity, res$n, design, equal, do.rate)
res[2] <- x[["grps"]]
ngrp   <- x[["ngrp"]]
res[3] <- paste(ngrp, collapse = " | ")
res[5] <- power.TOST.sds(CV = CV, theta0 = theta0, theta1 = theta1,
theta2 = theta2, n = res$n, design = design, grps = res$grps, ngrp = ngrp,
gmodel = gmodel, progress = FALSE, nsims = nsims)
res[6] <- 100 * (res$m.2 - res$m.1) / res$m.1 des <- known.designs()[2:3] df <- as.expression(des$df[des$design == design]) dfs <- rep(NA_integer_, 2) n <- res$n
dfs[1] <- as.integer(eval(parse(text = df)))
dfs[2] <- as.integer(dfs[1] - (res$grps - 1)) txt <- paste(txt, paste0("\n", paste(rep("—", 46), collapse = "")), "\nTotal sample size :", res$n,
"\nNumber of groups     :", sprintf("%2.0f", res[2]),
"\nSubjects per group   :", res[3],
"\nDegrees of freedom   :",
sprintf("%3i", dfs[1]), "(pooled model 4)",
"\n                      ",
sprintf("%3i", dfs[2]), "(group model 5)")
if (do.rate > 0) {
res[2, 1] <- nadj(n, do.rate, ns)
res[2, 4] <- signif(power.TOST(CV = CV, theta0 = theta0, theta1 = theta1,
theta2 = theta2, n = res$n[2], design = design), 4) x <- grouping(capacity, res$n[2], design, equal, do.rate)
res[2, 2] <- x[["grps"]]
ngrp      <- x[["ngrp"]]
res[2, 3] <- paste(ngrp, collapse = " | ")
res[2, 5] <- power.TOST.sds(CV = CV, theta0 = theta0, theta1 = theta1,
theta2 = theta2, n = res$n[2], design = design, grps = res$grps, ngrp = ngrp,
gmodel = gmodel, progress = FALSE, nsims = nsims)
res[2, 6] <- 100 * (res$m.2[2] - res$m.1[2]) / res$m.1[2] dfs <- rep(NA_integer_, 2) n <- res$n[2]
dfs[1]    <- as.integer(eval(parse(text = df)))
dfs[2]    <- as.integer(dfs[1] - (res$grps[2] - 1)) txt <- paste(txt, paste0("\n", paste(rep("—", 46), collapse = "")), "\nAnticip. dropout-rate:", sprintf("%2g%%", 100 * do.rate), "\nAdjusted sample size :", res$n[2],
"\nNumber of groups     :", sprintf("%2.0f", res[2, 2]),
"\nSubjects per group   :", res[2, 3],
"\nDegrees of freedom   :",
sprintf("%3i", dfs[1]), "(pooled model 4)",
"\n                      ",
sprintf("%3i", dfs[2]), "(group model 5)")
}
res[, 4] <- sprintf("%6.4f", res[, 4])
res[, 5] <- sprintf("%6.4f", res[, 5])
res[, 6] <- sprintf("%+.3f%%", res[, 6])
names(res)[4:6] <- c("pooled model", "group model", "change")
txt    <- paste0(txt, paste0("\n", paste(rep("—", 46), collapse = "")),
"\nAchieved power of the pooled and group models;",
"\nrelative change in power of the group model",
"\ncompared to the pooled model:\n\n")
if (show) cat(txt)
if (do.rate == 0) {
print(res[, c(1, 4:6)], row.names = FALSE)
}else {
print(res[, c(1, 4:6)])
}
return(result = list(power = res, df = dfs))
}

## Conventional ABE

Say, the assumed CV is 30%, the T/R-ratio 0.95, and we plan the study for ≥ 80% power in a 2×2×2 crossover design. The capacity of the clinical site is 24. We anticipate a dropout-rate of 5% and adjust the sample size accordingly (i.e., dose more subjects in order to have at least as many eligible subjects than estimated for the target power). We want to have at least one group with the capacity of the site.

CV       <- 0.30
theta0   <- 0.95
target   <- 0.80
design   <- "2x2x2"
capacity <- 24
do.rate  <- 0.05
x <- sim.groups(CV = CV, theta0 = theta0, target = target,
design = design, capacity = capacity, do.rate = do.rate)
# Attempting to have at least one group with
# the maximum capacity of the clinical site.
# CV                   : 0.3000
# theta0               : 0.9500
# BE-limits            : 0.8000 – 1.2500
# Target power         : 0.80
# Design               : 2x2x2
# Clinical capacity    : 24
# ——————————————————————————————————————————————
# Total sample size    : 40
# Number of groups     :  2
# Subjects per group   : 24 | 16
# Degrees of freedom   :  38 (pooled model 4)
#                         37 (group model 5)
# ——————————————————————————————————————————————
# Anticip. dropout-rate:  5%
# Adjusted sample size : 44
# Number of groups     :  2
# Subjects per group   : 24 | 20
# Degrees of freedom   :  42 (pooled model 4)
#                         41 (group model 5)
# ——————————————————————————————————————————————
# Achieved power of the pooled and group models;
# relative change in power of the group model
# compared to the pooled model:
#
#           n pooled model group model  change
# Expected 40       0.8158      0.8120 -0.466%
# Adjusted 44       0.8508      0.8498 -0.116%

In the group model $$\small{(5)}$$ we loose only one degree of freedom compared to the pooled model $$\small{(4)}$$. Hence, the loss in power is negligble.

Let us try a four-period full replicate design. Since the periods double, we anticipate a higher dropout-rate of 10%.

CV       <- 0.30
theta0   <- 0.95
target   <- 0.80
design   <- "2x2x4"
do.rate  <- 0.1
capacity <- 24
x <- sim.groups(CV = CV, theta0 = theta0, target = target,
design = design, do.rate = do.rate, capacity = capacity)
# Error: The adjusted sample size of 24 does not exhaust the clinical capacity.

Although we anticipated a higher dropout-rate, we could readily perform the study in a single group. Would there be problems with power?

make.equal <- function(n, ns) {
# make equally sized sequences
return(as.integer(ns * (n %/% ns + as.logical(n %% ns))))
}
nadj <- function(n, do.rate, ns) {
return(as.integer(make.equal(n / (1 - do.rate), ns)))
}
CV      <- 0.30
theta0  <- 0.95
target  <- 0.80
design  <- "2x2x4"
do.rate <- 0.1
ns      <- as.integer(substr(design, 3, 3))
x       <- sampleN.TOST(CV = CV, theta0 = theta0, targetpower = target,
design = design, print = FALSE, details = FALSE)[7:8]
elig    <- nadj(x[["Sample size"]], do.rate, ns):x[["Sample size"]]
x       <- data.frame(eligible = elig, power = NA_real_)
for (i in seq_along(elig)) {
x$power[i] <- suppressMessages( power.TOST(CV = CV, theta0 = theta0, n = elig[i], design = design)) } print(signif(x, 4), row.names = FALSE) # eligible power # 24 0.8819 # 23 0.8682 # 22 0.8543 # 21 0.8374 # 20 0.8202 Not at all. Furthermore, in the full replicate design performed in a single group we have more degrees of freedom than in the 2×2×2 crossover performed in two groups (see the table above). CV <- 0.335 var <- CV2mse(CV) n.cross <- c(44, 40) n.repl <- c(24, 20) cross <- data.frame(design = rep("2x2x2", 2), groups = rep(2, 2), n = n.cross, df = n.cross - 2 - (2 - 1), CL.lo = NA_real_, CL.hi = NA_real_) repl <- data.frame(design = rep("2x2x4", 2), groups = rep(1, 2), n = n.repl, df = 3 * n.repl - 4, CL.lo = NA_real_, CL.hi = NA_real_) for (i in 1:2) { cross[i, 5:6] <- exp(log(theta0) - log(1) + c(-1, +1) * qt(1 - 0.05, df = cross$df[i]) *
sqrt(var / 2 * (1 / n.cross[1] + 1 / n.cross[2])))
repl[i, 5:6]  <- exp(log(theta0) - log(1) + c(-1, +1) *
qt(1 - 0.05, df = repl$df[i]) * sqrt(var / 4 * (1 / n.repl[1] + 1 / n.repl[2]))) } cross$width   <- cross$CL.hi - cross$CL.lo
repl$width <- repl$CL.hi - repl$CL.lo diffs <- cross$width - repl$width x <- rbind(repl, cross) names(x)[2:3] <- c("group(s)", "N") x[, 2] <- sprintf("%.0f ", x[, 2]) x[, 5] <- sprintf("%.3f%%", 100 * x[, 5]) x[, 6] <- sprintf("%.3f%%", 100 * x[, 6]) x[, 7] <- sprintf("%.3f%%", 100 * x[, 7]) print(x, row.names = FALSE) cat("Maximum difference in the width of CIs:", sprintf("%.2g%%.\n",100 * max(diffs))) # design group(s) N df CL.lo CL.hi width # 2x2x4 1 24 68 87.492% 103.152% 15.660% # 2x2x4 1 20 56 87.471% 103.177% 15.707% # 2x2x2 2 44 41 87.277% 103.406% 16.128% # 2x2x2 2 40 37 87.259% 103.428% 16.169% # Maximum difference in the width of CIs: 0.47%. Hence, the CIs in the replicate design will be ≈ 0.5% narrower than in the 2×2×2 crossover. ## NTID Say, we have to deal with a narrow therapeutic index drug (BE-limits 90.00 – 111.11%) with low variability of 7%. We assume a T/R-ratio of 0.975 and target ≥ 80% power in a 2×2×2 crossover design. Furthermore, we anti­cipate a dropout-rate of 5%. Due to potential AEs close monitoring of subjects is required and the clinical site is equipped with eight fully monitored beds. By specifying equal = TRUE the script attempts to generate equally sized groups. CV <- 0.07 theta0 <- 0.975 theta1 <- 0.90 target <- 0.80 design <- "2x2x2" do.rate <- 0.05 capacity <- 8 x <- sim.groups(CV = CV, theta0 = theta0, theta1 = theta1, target = target, design = design, do.rate = do.rate, capacity = capacity, equal = TRUE) # Note: Imbalanced sequences in groups (7|7) corrected. # Attempting to generate equally sized groups. # CV : 0.0700 # theta0 : 0.9750 # BE-limits : 0.9000 – 1.1111 # Target power : 0.80 # Design : 2x2x2 # Clinical capacity : 8 # —————————————————————————————————————————————— # Total sample size : 12 # Number of groups : 2 # Subjects per group : 6 | 6 # Degrees of freedom : 10 (pooled model 4) # 9 (group model 5) # —————————————————————————————————————————————— # Anticip. dropout-rate: 5% # Adjusted sample size : 14 # Number of groups : 2 # Subjects per group : 8 | 6 # Degrees of freedom : 12 (pooled model 4) # 11 (group model 5) # —————————————————————————————————————————————— # Achieved power of the pooled and group models; # relative change in power of the group model # compared to the pooled model: # # n pooled model group model change # Expected 12 0.8274 0.8218 -0.677% # Adjusted 14 0.8849 0.8824 -0.283% Two equally sized groups of six subjects would be possible with the estimated sample size. That’s what we requested. However, for the adjusted sample size of 14 subjects, groups of seven subjects would be imbalanced.5 Hence, a note is issued and sequences are corrected in such a way that both groups are balanced. As usual, the loss in power by the group model $$\small{(5)}$$ compared to the pooled model $$\small{(4)}$$ is less than one percent and thus, negligble. ## Type I Error We can assess the Type I Error by setting $$\small{\theta_0}$$ to one of the BE limits $$\small{\left\{\theta_1,\theta_2\right\}}$$, i.e., assume that the null hypothesis is true. This can be done either by simulations with the function power.TOST.sds() or with the function power.TOST() after adjusting the degrees of freedom as outlined above. We need at least 106 simulations in order to obtain a stable result. # Cave: long runtime CV <- 0.30 design <- "2x2x2" theta1 <- 0.80 theta2 <- 1.25 ngrp <- c(22L, 18L) grps <- 2 n <- sum(ngrp) gmodel <- 3:2 nsims <- 1e6 adj <- c(n, n - (grps - 1)) x <- data.frame(model = c("pooled", "group"), df = c(n - 2, n - 2 - (grps - 1)), simulated = NA_real_, exact = NA_real_) for (i in seq_along(gmodel)) { x$simulated[i] <- power.TOST.sds(CV = CV, theta0 = theta2, n = n,
design = design, grps = grps,
ngrp = ngrp, gmodel = gmodel[i],
nsims = nsims, progress = FALSE)
xexact[i] <- suppressMessages( power.TOST(CV = CV, theta0 = theta2, n = adj[i], design = design)) } cat("CV :", sprintf("%.4f", CV), "\nBE-limits :", sprintf("%.4f – %.4f", theta1, theta2), "\nDesign :", design, "\nTotal sample size :", n, "\nNumber of groups :", sprintf("%2.0f", grps), "\nSubjects per group:", paste(ngrp, collapse = " | "), "\nNull assessed at :", sprintf("%.4f\n\n", theta2)) print(x, row.names = FALSE, right = FALSE) # CV : 0.3000 # BE-limits : 0.8000 – 1.2500 # Design : 2x2x2 # Total sample size : 40 # Number of groups : 2 # Subjects per group: 22 | 18 # Null assessed at : 1.2500 # # model df simulated exact # pooled 38 0.049801 0.04999975 # group 37 0.049791 0.04999961 Of course, in the pooled model $$\small{(4)}$$ the Type I Error is controlled. Due to less degrees of freedom in the group model $$\small{(5)}$$ the Type I Error is even lower. # Summary Coming back to the question asked in the Introduction. To repeat: What is the best approach to deal with the limited capacity of a clinical site? • If ever possible find a clinical site which can accomodate all subjects at once. • If that is not possible, consider multiple groups (single site) or multiple sites. The latter might be unavoidable for studies in patients where recruitment is difficult. • Whether the statistical model has to be modified is still an open issue. The Russian authority – and recently some Euro­pean agencies – prefer for multi-group studies (5). For multi-site studies I recommend (5). As shown, the loss in power compared to (4) is practically negligble. The chosen model must be unambigously stated in the protocol. • Due to its smaller sample size a replicate design might be an alternative avoiding multiple groups. • It gives a slight incentive in terms of power, since the confidence interval will be narrower than the one of a 2×2×2 crossover due to the larger degrees of freedom. • If a subject drops out from the study after the second period, the data can still be used (in most6 designs at least one observation of T and R is available). • Even if the study was not planned for scaled average bioequivalence (ABEL or RSABE), the additional information about the within-subject variability of R (and in full replicated designs additionally of T) is a convenient side effect. It avoids the problems outlined in another article. A member of the EMA’s PKWP once told me that he would like to see all studies performed in a replicate design – regardless whether the drug / drug product is highly variable or not. From a purely statistical perspective, he is right and it is one of the rare cases where we were of the same opinion. Power (and hence, the sample size) depends on the number of treatments – the smaller sample size in replicate designs is compensated by more administrations. If the total sample size of a 2×2×2 crossover design is $$\small{N}$$, then the sample size of a four-period replicate design is $$\small{\tfrac{1}{2}N}$$ and of a three-period replicate design $$\small{\tfrac{3}{4}N}$$. Costs of a replicate design are similar to the common 2×2×2 crossover design (for details see another article). Nevertheless, smaller sample sizes may come with a price. We will save costs due to less pre-/post-study exams but have to pay a higher subject remuneration (more hospitalizations and blood draws). However, we have the same number of samples to analyze and study costs are driven to a good part by bioanalytics.7 Furthermore, one must be aware that more periods / washout phases increase the chance of dropouts. previous section ↩︎ # Postscript Testing for a significant group- (or site-) by-treatment interaction should be avoided. In a nutshell: Model $$\small{(5)}$$ is expanded with an interaction term ($$\small{F\times treatment}$$) to \eqalign{response&|\;F,\,\;sequence,\,subject(F\times sequence),\\ &\phantom{|}\;period(F),F\times sequence,\,treatment,\,F\times treatment}\tag{9} • Test for a group- (or site-) by-treatment interaction at α = 0.1.8 • If α ≥ 0.1, evaluate the study by the group model (5). • If α < 0.1, pooling is not allowed. Evaluate one of the groups – generally the largest9 – by (4). That leads to a substantial loss in power.10 • As any pre-test it will result in an inflated Type I Error (increased patient’s risk).11 12 13 Not rocket science but basic statistics… • If certain conditions are fulfilled, the study can be evaluated by the conventional model (4).11 Regrettably the WHO stated also:14 In those cases where the subjects are recruited and treated in groups, it is appropriate to investigate the statistical significance of the group-by-formulation interaction e.g., with the following ANOVA model: Group, Sequence, Formulation, Period (nested within Group), Group-by-Sequence interaction, Subject (nested within Group*Sequence) and Group-by-Formulation interaction. If this interaction is significant, the study results are not interpretable. However, it is not considered to be correct to report the results of the 90% confidence interval of the ratio test/comparator based on the standard error derived from this ANOVA. If the group-by-formulation interaction is not significant, the 90% confidence interval should be calculated based on the ANOVA model defined in the protocol. This model may or may not include the group effect as pre-defined in the protocol. This depends on whether the group effect is believed to explain the variability observed in the data. WHO (2020)14 (my emphases) On the contrary: It is not appropriate! At least we can state in the protocol that we don’t believe [sic]15 in a group-by-treatment interaction. Believes don’t belong to the realm of science – only assumptions do. The problems arising with $$\small{(9)}$$ in terms of power and inflation of the Type I Error is elaborated in another article. When I gave a presentation16 about groups/sites in bioequivalence in front of European regulatory (‼) statisticians, they were asking in disbelieve »What the heck are you talking about?« The European Guideline17 states: The study should be designed in such a way that the formulation effect can be distinguished from other effects. The precise model to be used for the analysis should be pre-specified in the protocol. The statistical analysis should take into account sources of variation that can be reasonably assumed to have an effect on the response variable. EMA (2010)17 (my emphasis) Is it reasonable to assume that groups have an effect on the PK response (i.e., a true group-by-treatment interaction)? IMHO, not at all. Recall that we have the same inclusion- & exclusion-criteria and hence, similar demographics, the same clinical site, and analyze samples with the same bioanalytical method. In a crossover study each subject acts as its own control, right? If we follow the approach outlined above, we will face a false positive rate at the level of the test $$\small{\alpha}$$ and will not be allowed to use $$\small{(4)}$$ or $$\small{(5)}$$ of pooled data in 10% of cases. If that happens, the loss in power will be tremendious. Returning to the example from above: design <- "2x2x2" theta0 <- 0.95 CV <- 0.30 grps <- 2 ngrp <- c(24, 16) res <- data.frame(model = c("pooled", "groups", "large group"), n = c(rep(sum(ngrp), 2), ngrp[1]), grps = c(rep(grps, 2), 1), n.grp = c(rep(paste(ngrp, collapse = "|"), 2), ngrp[1]), power = NA_real_, loss = NA_real_) respower[1] <- power.TOST(CV = CV, theta0 = theta0,
n = res$n[1], design = design) res$power[2] <- power.TOST.sds(CV = CV, theta0 = theta0, n = res$n[2], design = design, grps = grps, ngrp = ngrp, gmodel = 2, progress = FALSE) res$power[3] <- power.TOST(CV = CV, theta0 = theta0,
n = ngrp[1], design = design)
res$loss[2:3] <- sprintf("%4.1f%%", 100 * abs((res$power[2:3] - res$power[1]) / res$power[1]))
res$loss[is.na(res$loss)] <- "-  "
res[, 3]      <- sprintf("%.0f    ", res[, 3])
res[1]        <- c("pooled (4)", "groups (5)", "large group only (4)")
res$power <- signif(res$power, 4)
names(res)[c(2:4)] <- c("N", "group(s)", "n/group")
print(res, row.names = FALSE)
#                 model  N group(s) n/group  power  loss
#            pooled (4) 40    2       24|16 0.8158   -
#            groups (5) 40    2       24|16 0.8120  0.5%
#  large group only (4) 24    1          24 0.5577 31.6%

Given, that would happen only in 10% of studies but if it does, power will be hardly larger than tossing a coin. Ex­ten­sive simulations of 2×2×2 crossover studies showed that the overall loss in power can reach ≈ 11%.

Hundreds (or thousands‽) of studies have been performed in multiple groups, evaluated by $$\small{(4)}$$ and were accepted by European agencies in the twinkling of an eye.

For the ugly details see another article.

previous section ↩︎

Helmut Schütz 2022
R and PowerTOST GPL 3.0, pandoc GPL 2.0.
1st version June 28, 2022. Rendered August 19, 2022 19:09 CEST by rmarkdown via pandoc in 0.59 seconds.

Footnotes and References

1. Labes D, Schütz H, Lang B. PowerTOST: Power and Sample Size for (Bio)Equivalence Studies. Package version 1.5.4. 2022-02-21. CRAN.↩︎

2. FDA, CDER. Guidance for Industry. Statistical Approaches to Establishing Bioequivalence. Rockville. January 2001. Download.↩︎

3. Statisticians may be more familiar with the terminology $$\small{\textrm{AB}|\textrm{BA}}$$.
Forgive me – in bioequivalence $$\small{\textrm{TR}|\textrm{RT}}$$ is commonly used, where $$\small{\textrm{T}}$$ denotes the Test and $$\small{\textrm{R}}$$ the Reference treatment (formulation).↩︎

4. If Subject 1 is randomized to sequence $$\small{\text{TR}}$$, there is not ‘another’ Subject 1 randomized to sequence $$\small{\text{RT}}$$. Randomization is not like Schrödinger’s cat. Hence, the nested term in the guidelines is an insult to the mind.↩︎

5. Not the same number of subjects in sequence $$\small{\textrm{TR}}$$ than in sequence $$\small{\textrm{RT}}$$.↩︎

6. Of course, not in the $$\small{\textrm{TTRR|RRTR}}$$ full replicate design. However, it is rarely employed in practice.↩︎

7. In case of a ‘poor’ bioanalytical method requiring a large sample volume: Since the total blood sampling volume is generally limited with the one of a blood donation, one may opt for a three-period full replicate or has to measure HCT prior to administration in higher periods and – for safety reasons – exclude subjects if their HCT is too high.↩︎

8. Since this is a between-subject comparison, it has low power. IMHO, the 10% level is arbitrary (I failed to find any justification).↩︎

9. What if more than one group have the same size? Has BE to be demonstrated for all of them? Quite possible, since that would be the most conservative approach. However, I failed to find a reference. That’s one of the reasons I prefer to have at least one group in the maximum capacity of the site instead of equally sized groups.↩︎

10. If you have – really! – nothing better to do (long runtime): Try the examples with the additional argument gmodel = 1 in the function. In the first example 5.3–5.7% power will be lost and in the second 3.6–6.0%.↩︎

11. Schütz H. Multi-Group Studies in Bioequivalence. To pool or not to pool? Presentation at: BioBridges 2018. Prague. 26–27 Sep­tem­ber 2018. Online.↩︎

12. For the examples I got in 106 simulations (extremely long runtime) an inflated Type I Error of ≈ 0.063.↩︎

13. In a meeting in April 2021 the FDA accepted my concerns about an inflated Tpye I Error and agreed that in a planned multi-center study model $$\small{(5)}$$ without a pre-test should be used.↩︎

14. WHO. Frequent deficiencies in BE study protocols. Geneva. November 2020. Online.↩︎

15. Quoting my late father: »If you believe, go to church.«↩︎

16. Schütz H. Multi-Group and Multi-Site Studies. To pool or not to pool? 2nd Annual Biosimilars Forum. Satellite Short Course. Budapest. 5 October 2017. Online.↩︎

17. EMA, CHMP. Guideline on the Investigation of Bioequivalence. CPMP/EWP/QWP/1401/98 Rev. 1/ Corr ** London. 20 January 2010. Online↩︎