Consider allowing JavaScript. Otherwise, you have to be proficient in reading since formulas will not be rendered. Furthermore, the table of contents in the left column for navigation will not be available and code-folding not supported. Sorry for the inconvenience.
Examples in this article were generated with
4.4.1
by the package PowerTOST
.1
Abbreviation | Meaning |
---|---|
\(\small{\alpha}\) | Level of the test |
BE | Bioequivalence |
CI | Confidence Interval |
\(\small{CV}\) |
Within-subject Coefficient of Variation (paired, crossover, replicate
designs), total CV (parallel design) |
\(\small{\Delta}\) | Clinically relevant difference |
\(\small{H_0}\) | Null hypothesis |
\(\small{H_1}\) | Alternative hypothesis (also \(\small{H_\textrm{a}}\)) |
\(\small{\theta_0=\mu_\textrm{T}/\mu_\textrm{R}}\) | True T/R-ratio |
\(\small{\theta_1,\;\theta_2}\) | Lower and upper limits of the acceptance range (BE margins) |
What is the best approach to deal with the limited capacity of a clinical site?
The most simple – and preferable – approach is to find a clinical site which is able to accommodate all subjects at once. If this is not possible, subjects could be allocated to multiple groups (aka cohorts) or sites. Whether or not a group- (site-) term should by included in the statistical model is still the topic of heated debates. In the case of multi-site studies regulators require a modification of the model.2 3 4
Since in replicate designs less subjects are required to achieve the same power than in conventional 2×2×2 crossover design, sometimes multi-group or multi-site studies can be avoided.
A basic knowledge of R is
required. To run the scripts at least version 1.4.8 (2019-08-29) of
PowerTOST
is suggested. Any version of R would likely do, though the current release of
PowerTOST
was only tested with version 4.3.3 (2024-02-29)
and later.
All scripts were run on a Xeon E3-1245v3 @ 3.40GHz (1/4 cores) 16GB RAM
with R 4.4.1 on Windows 7 build 7601, Service
Pack 1, Universal C Runtime 10.0.10240.16390.
The examples deal primarily with the 2×2×2 crossover design (\(\small{\textrm{TR}|\textrm{RT}}\))5 but the concept is applicable to any kind of crossover (Higher-Order, replicate designs) or parallel designs assessed for equivalence.
In the following I consider groups, i.e., studies performed in a single site. However, the concept is applicable for studies performed in multiple sites as well. Studies in groups in multiple sites are out of scope of this article.
The lower and upper limits of the bioequivalence (BE) range \(\small{\{\theta_1,\theta_2\}}\) are calculated based on the ‘clinically not relevant difference’ \(\small{\Delta}\) \[\left\{\theta_1=100\,(1-\Delta),\,\theta_2=100\,(1-\Delta)^{-1}\right\}\tag{1}\] Commonly \(\small{\Delta}\) is set to 0.20. For narrow therapeutic index drugs (the EMA and other jurisdictions) \(\small{\Delta}\) is set to 0.10, and for \(\small{C_\textrm{max}}\) (Russian Federation, the EEU, South Africa) \(\small{\Delta}\) is set to 0.25. We obtain: \[\begin{matrix}\tag{2} \left\{\theta_1=80.00\%,\,\theta_2=125.00\%\right\}\\ \left\{\theta_1=90.00\%,\,\theta_2=111.1\dot{1}\%\right\}\\ \left\{\theta_1=75.00\%,\,\theta_2=133.3\dot{3}\%\right\} \end{matrix}\]
Conventionally BE is assessed by the confidence interval inclusion approach: \[H_0:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}\not\subset\left\{\theta_1,\theta_2\right\}\:vs\:H_1:\theta_1<\frac{\mu_\textrm{T}}{\mu_\textrm{R}}<\theta_2\tag{3}\]
As long as the \(\small{100\,(1-2\,\alpha)}\) confidence interval lies entirely within pre-specified limits \(\small{\{\theta_1,\theta_2\}}\), the null hypothesis of inequivalence in \(\small{(3)}\) is rejected and the alternative hypothesis of equivalence in \(\small{(3)}\) is accepted.
The model for a crossover design in average bioequivalence as stated
in guidelines is \[\eqalign{response&|\;sequence,\,subject(sequence),\\
&\phantom{|}\;period,\,treatment}\tag{4}\] Most
regulations recommend an
ANOVA, i.e., all
effects fixed. The
FDA and Health
Canada recommend a mixed-effects model, i.e., to specify \(\small{subject(sequence)}\) as a random
effect and all others fixed.
It must be mentioned that in comparative bioavailability studies
subjects are usually uniquely coded. Hence, the nested term \(\small{subject(sequence)}\) in \(\small{(4)}\) is a
bogus one6 and could be replaced by the simple \(\small{subject}\) as well. See also another
article.
If the study is performed in multiple groups or sites, model \(\small{(4)}\) can be modified to \[\eqalign{response&|\;F,\,\;sequence,\,subject(F\times sequence),\\ &\phantom{|}\;period(F),F\times sequence,\,treatment\small{\textsf{,}}}\tag{5}\] where \(\small{F}\) is the code for \(\small{group}\) or \(\small{site}\), respectively.
When \(\small{N_\text{F}}\) is the
number of groups or sites, in \(\small{(5)}\) there are \(\small{N_\text{F}-1}\) less residual
degrees of freedom than in \(\small{(4)}\). The table below gives
the designs implemented in PowerTOST
.
\[\small{\begin{array}{l|lccccrr}
\phantom{00000000}\textsf{design} & \phantom{00}\textsf{code} &
\textsf{t} & \textsf{s} & \textsf{p} & \textsf{m} &
(4)\phantom{0} & (5)\phantom{00000000}\\\hline
\textsf{Parallel} & \textsf{parallel} & 2 & - & 1 &
1 & N-2 & N-2-(N_\text{F}-1)\\
\textsf{Paired means} & \textsf{paired} & 2 & - & 2
& 2 & N-1 & N-1-(N_\text{F}-1)\\
\textsf{Crossover} & \textsf{2}\times\textsf{2}\times\textsf{2}
& 2 & 2 & 2 & \tfrac{1}{2} & N-2 &
N-2-(N_\text{F}-1)\\
\textsf{2-sequence 3-period full replicate} &
\textsf{2}\times\textsf{2}\times\textsf{3} & 2 & 2 & 3 &
\tfrac{3}{8} & 2N-3 & 2N-3-(N_\text{F}-1)\\
\textsf{2-sequence 4-period full replicate} &
\textsf{2}\times\textsf{2}\times\textsf{4} & 2 & 2 & 4 &
\tfrac{1}{4} & 3N-4 & 3N-4-(N_\text{F}-1)\\
\textsf{4-sequence 4-period full replicate} &
\textsf{2}\times\textsf{4}\times\textsf{4} & 2 & 4 & 4 &
\tfrac{1}{16} & 3N-4 & 3N-4-(N_\text{F}-1)\\
\textsf{Partial replicate} &
\textsf{2}\times\textsf{3}\times\textsf{3} & 2 & 3 & 3 &
\tfrac{1}{6} & 2N-2 & 2N-3-(N_\text{F}-1)\\
\textsf{Balaam}'\textsf{s} &
\textsf{2}\times\textsf{4}\times\textsf{2} & 2 & 4 & 2 &
\tfrac{1}{2} & N-2 & N-2-(N_\text{F}-1)\\\textsf{Latin Square or
Williams}' & \textsf{3}\times\textsf{3} & 3 & 3 & 3
& \tfrac{2}{9} & 2N-4 & 2N-4-(N_\text{F}-1)\\
\textsf{Williams}' & \textsf{3}\times\textsf{6}\times\textsf{3}
& 3 & 6 & 3 & \tfrac{1}{18} & 2N-4 &
2N-4-(N_\text{F}-1)\\
\textsf{Latin Square or Williams}' & \textsf{4}\times\textsf{4}
& 4 & 4 & 4 & \tfrac{1}{8} & 3N-6 &
3N-6-(N_\text{F}-1)\\
\end{array}}\] \(\small{\textsf{code}}\) is the
design
-argument in the functions of PowerTOST
.
\(\small{\textsf{t, s, p}}\) are the
number of treatments, sequences, and periods, respectively. \(\small{\textsf{m}}\) is the multiplier in
the radix of \(\small{(6)}\) and \(\small{N}\) is the total number of
subjects, i.e., \(\small{N=\sum_{i=1}^{i=s}n_i}\).
The back-transformed \(\small{1-2\,\alpha}\) Confidence Interval (CI) is calculated by \[CI=100\,\exp\left(\log_{e}\overline{x}_\text{T}-\log_{e}\overline{x}_\text{R}\mp t_{df,\alpha}\sqrt{m \times\widehat{s^2}\sum_{i=1}^{i=s}\frac{1}{n_i}}\right)\small{\textsf{,}}\tag{6}\] where \(\widehat{s^2}\) is the residual variance of the model and \(\small{n_i}\) is the number of subjects in the \(\small{i^\text{th}}\) of \(\small{s}\) sequences.
library(PowerTOST) # attach it to run the examples
In a 2×2×2 crossover design the residual
degrees of freedom in the models are \[\eqalign{(4):df&=N-2\\
(5):df&=N-2-(N_\text{F}-1)\small{\textsf{,}}}\tag{7}\]
where \(\small{N}\) is total number of
subjects. Therefore, the CI by
\(\small{(6)}\)
for given \(\small{N}\) and \(\small{\widehat{s^2}}\) by \(\small{(5)}\) will
always be wider (more conservative) than the one by \(\small{(4)}\) due
to the fewer degrees of freedom and hence, larger \(\small{t}\)-value.
However, unless the sample size is not small and the number of groups
(or sites) large, the difference in widths of the
CIs is generally
neglegible.
<- "2x2x2"
design <- 0.95
theta0 <- 0.335
CV <- CV2mse(CV)
var <- sampleN.TOST(CV = CV, theta0 = theta0, design = design,
n print = FALSE)[["Sample size"]]
<- n2 <- n / 2
n1 <- c(2:4, 6, 8)
NF <- data.frame(n = n, NF = NF,
x df.4 = n - 2,
CL.lo.4 = NA_real_, CL.hi.4 = NA_real_,
width.4 = NA_real_,
df.5 = n - 2 - (NF - 1),
CL.lo.5 = NA_real_, CL.hi.5 = NA_real_,
width.5 = NA_real_)
for (i in seq_along(NF)) {
4:5] <- exp(log(theta0) - log(1) + c(-1, +1) *
x[i, qt(1 - 0.05, df = x$df.4[i]) *
sqrt(var / 2 * (1 / n1 + 1 / n2)))
8:9] <- exp(log(theta0) - log(1) + c(-1, +1) *
x[i, qt(1 - 0.05, df = x$df.5[i]) *
sqrt(var / 2 * (1 / n1 + 1 / n2)))
}$width.4 <- x$CL.hi.4 - x$CL.lo.4
x$width.5 <- x$CL.hi.5 - x$CL.lo.5
x<- x$width.5 - x$width.4
diffs names(x)[c(1, 3:5, 7:9)] <- c("N", "df (4)", "CL.lo (4)", "CL.hi (4)",
"df (5)", "CL.lo (5)", "CL.hi (5)")
4] <- sprintf("%.3f%%", 100 * x[, 4])
x[, 5] <- sprintf("%.3f%%", 100 * x[, 5])
x[, 8] <- sprintf("%.3f%%", 100 * x[, 8])
x[, 9] <- sprintf("%.3f%%", 100 * x[, 9])
x[, print(x[, c(1:5, 7:9)], row.names = FALSE)
cat("Maximum difference in the width of CIs:",
sprintf("%.2g%%.\n", 100 * max(diffs)))
# N NF df (4) CL.lo (4) CL.hi (4) df (5) CL.lo (5) CL.hi (5)
# 48 2 46 84.955% 106.232% 45 84.951% 106.238%
# 48 3 46 84.955% 106.232% 44 84.946% 106.243%
# 48 4 46 84.955% 106.232% 43 84.942% 106.249%
# 48 6 46 84.955% 106.232% 41 84.932% 106.262%
# 48 8 46 84.955% 106.232% 39 84.920% 106.276%
# Maximum difference in the width of CIs: 0.079%.
If we have a study with 48 subjects in a 2×2×2 crossover design performed in two to eight groups (or sites), the difference in the widths of CIs is negligible in any case.
In a two-sequence four-period full replicate design the residual degrees of freedom in the models are \[\eqalign{(4):df&=3N-4\\ (5):df&=3N-4-(N_\text{F}-1)\small{\textsf{,}}}\tag{8}\]where \(\small{N}\) is total number of subjects. We need only half of the subjects because power depends on the number of treatments, which is the same like in the 2×2×2 crossover. Again, the CI by \(\small{(6)}\) for given \(\small{N}\) and \(\small{\widehat{s^2}}\) by \(\small{(5)}\) will always be wider than the one by \(\small{(4)}\).
<- "2x2x4"
design <- 0.95
theta0 <- 0.335
CV <- CV2mse(CV)
var <- sampleN.TOST(CV = CV, theta0 = theta0, design = design,
n print = FALSE)[["Sample size"]]
<- n2 <- n / 2
n1 <- c(2:4, 6)
NF <- data.frame(n = n, NF = NF,
x df.4 = 3 * n - 4,
CL.lo.4 = NA_real_, CL.hi.4 = NA_real_,
width.4 = NA_real_,
df.5 = 3 * n - 4 - (NF - 1),
CL.lo.5 = NA_real_, CL.hi.5 = NA_real_,
width.5 = NA_real_)
for (i in seq_along(NF)) {
4:5] <- exp(log(theta0) - log(1) + c(-1, +1) *
x[i, qt(1 - 0.05, df = x$df.4[i]) *
sqrt(var / 4 * (1 / n1 + 1 / n2)))
8:9] <- exp(log(theta0) - log(1) + c(-1, +1) *
x[i, qt(1 - 0.05, df = x$df.5[i]) *
sqrt(var / 4 * (1 / n1 + 1 / n2)))
}$width.4 <- x$CL.hi.4 - x$CL.lo.4
x$width.5 <- x$CL.hi.5 - x$CL.lo.5
x<- x$width.5 - x$width.4
diffs names(x)[c(1, 3:5, 7:9)] <- c("N", "df (4)", "CL.lo (4)", "CL.hi (4)",
"df (5)", "CL.lo (5)", "CL.hi (5)")
4] <- sprintf("%.3f%%", 100 * x[, 4])
x[, 5] <- sprintf("%.3f%%", 100 * x[, 5])
x[, 8] <- sprintf("%.3f%%", 100 * x[, 8])
x[, 9] <- sprintf("%.3f%%", 100 * x[, 9])
x[, print(x[, c(1:5, 7:9)], row.names = FALSE)
cat("Maximum difference in the width of CIs:",
sprintf("%.2g%%.\n",100 * max(diffs)))
# N NF df (4) CL.lo (4) CL.hi (4) df (5) CL.lo (5) CL.hi (5)
# 24 2 68 85.018% 106.154% 67 85.016% 106.156%
# 24 3 68 85.018% 106.154% 66 85.014% 106.159%
# 24 4 68 85.018% 106.154% 65 85.012% 106.161%
# 24 6 68 85.018% 106.154% 63 85.008% 106.167%
# Maximum difference in the width of CIs: 0.023%.
Here the difference in the widths of CIs is even smaller than in the 2×2×2 crossover due to the larger degrees of freedom.
The function power.TOST.sds()
of PowerTOST
supports simulations of \(\small{(5)}\) of studies in a 2×2×2
crossover, as well as in full and partial replicate designs.
The script supports sample size estimation for Average Bioequivalence
and generation of groups based on the clinical capacity, where sequences
within groups will be balanced. The estimated sample size can be
adjusted based on an anticipated dropout-rate. If this is selected
(do.rate > 0
) two simulations are performed: One for the
adjusted sample size (optimistic: no dropouts) and one for the estimated
sample size (pessimistic: dropout-rate realized).
Two options for the generation of groups are supported by the logical
argument equal
:
FALSE
: Attempts to generate at least one group
with the maximum size of the clinical site (default).TRUE
: Attempts to generate equally sized
groups.Last but not least power by \(\small{(5)}\) is compared to exact
power by \(\small{(4)}\).
As we have seen in the table above, due to its
lower degrees of freedom power of model \(\small{(5)}\) should always be lower
than the one of model \(\small{(4)}\). If you get a positive
change
value in the comparison, it is a simulation
artifact. In such a case, increase the number of simulations
(nsims = 1e6
or higher). Cave: 168
LOC.
<- function(CV, theta0 = 0.95, theta1, theta2, target = 0.80,
sim.groups design = "2x2x2", capacity, equal = FALSE,
gmodel = 2, do.rate = 0, nsims = 1e5L, show = TRUE) {
##########################################################
# Explore the impact on power of a group model com- #
# pared to the conventional model of pooled data via #
# simulations. #
# ------------------------------------------------------ #
# capacity: maximum capacity of the clinical site #
# equal: TRUE : tries to generate equally sized #
# groups #
# FALSE: tries to get at least one group with #
# the maximum size of the clinical site #
# (default) #
# do.rate: anticipated dropout-rate; if > 0 a second #
# simulation is performed based on the adjust- #
# ed sample size #
##########################################################
if (missing(theta1) & missing(theta2)) theta1 <- 0.80
if (missing(theta1)) theta1 <- 1 / theta2
if (missing(theta2)) theta2 <- 1 / theta1
if (theta0 < theta1 | theta0 > theta2)
stop("theta0 must be within {theta1, theta2}.", call. = FALSE)
if (theta0 == theta1 | theta0 == theta2)
stop("Simulation of the Type I Error not supported.",
"\n Use power.TOST.sds() directly.", call. = FALSE)
if (!design %in% c("2x2", "2x2x2", "2x3x3", "2x2x4", "2x2x3"))
stop("Design \"", design, "\" not implemented.", call. = FALSE)
<- as.integer(substr(design, 3, 3))
ns <- function(n, ns) {
make.equal # make equally sized sequences
return(as.integer(ns * (n %/% ns + as.logical(n %% ns))))
}<- function(n, do.rate, ns) {
nadj # adjust the sample size
return(as.integer(make.equal(n / (1 - do.rate), ns)))
}<- function(capacity, n, design, equal, do.rate) {
grouping # based on the sample size and capacity, calculate the
# number of groups and subjects / group
if (do.rate == 0) {
<- n
x <- paste("The sample size of", x)
stop.txt else {
}<- nadj(n, do.rate, ns)
x <- paste("The adjusted sample size of", x)
stop.txt
}if (x <= capacity) {
stop(stop.txt, " does not exhaust the clinical capacity.",
call. = FALSE)
else {
}# split sample size into >=2 groups based on capacity
# TODO: Handle a case where with no dropouts <= capacity (no grouping)
# but with the adjusted sample size two groups are required
if (equal) {# attempt to make all groups equally sized
<- ceiling(n / capacity)
grps <- rep(n / grps, grps)
tmp <- make.equal(tmp, ns)
ngrp if (!isTRUE(all.equal(tmp, ngrp)))
message("Note: Imbalanced sequences in groups (",
paste(round(tmp, 0), collapse = "|"), ") corrected.\n")
if (sum(ngrp) > n) ngrp[length(ngrp)] <- n-sum(ngrp[-1])
else { # at least one group = capacity
}<- rep(0, ceiling(n / capacity))
ngrp <- length(ngrp)
grps 1] <- capacity
ngrp[for (j in 2:grps) {
<- sum(ngrp) # what we have so far
n.tot if (n.tot + capacity <= n) {
<- capacity
ngrp[j] else {
}<- n - n.tot
ngrp[j]
}
}
}
}return(ngrp = list(grps = length(ngrp), ngrp = ngrp))
}if (equal) {
<- "Attempting to generate equally sized groups."
txt else {
}<- paste("Attempting to have at least one group with",
txt "\nthe maximum capacity of the clinical site.")
}<- paste(txt, "\nCV :", sprintf("%.4f", CV),
txt "\ntheta0 :", sprintf("%.4f", theta0),
"\nBE-limits :",
sprintf("%.4f – %.4f", theta1, theta2),
"\nTarget power :", sprintf("%.2f", target),
"\nDesign :", design,
"\nClinical capacity :", capacity)
<- sampleN.TOST(CV = CV, theta0 = theta0, theta1 = theta1,
tmp theta2 = theta2, targetpower = target,
design = design, print = FALSE)
<- data.frame(n = NA_integer_, grps = NA_integer_,
res n.grp = NA_integer_, m.1 = NA_real_, m.2 = NA_real_,
change = NA_real_)
$n <- tmp[["Sample size"]]
res4] <- tmp[["Achieved power"]]
res[<- grouping(capacity, res$n, design, equal, do.rate)
x 2] <- x[["grps"]]
res[<- x[["ngrp"]]
ngrp 3] <- paste(ngrp, collapse = " | ")
res[5] <- power.TOST.sds(CV = CV, theta0 = theta0, theta1 = theta1,
res[theta2 = theta2, n = res$n,
design = design, grps = res$grps, ngrp = ngrp,
gmodel = gmodel, progress = FALSE, nsims = nsims)
6] <- 100 * (res$m.2 - res$m.1) / res$m.1
res[<- known.designs()[2:3]
des <- as.expression(des$df[des$design == design])
df <- rep(NA_integer_, 2)
dfs <- res$n
n 1] <- as.integer(eval(parse(text = df)))
dfs[2] <- as.integer(dfs[1] - (res$grps - 1))
dfs[<- paste(txt, paste0("\n", paste(rep("—", 46), collapse = "")),
txt "\nTotal sample size :", res$n,
"\nNumber of groups :", sprintf("%2.0f", res[2]),
"\nSubjects per group :", res[3],
"\nDegrees of freedom :",
sprintf("%3i", dfs[1]), "(pooled model 4)",
"\n ",
sprintf("%3i", dfs[2]), "(group model 5)")
if (do.rate > 0) {
2, 1] <- nadj(n, do.rate, ns)
res[2, 4] <- signif(power.TOST(CV = CV, theta0 = theta0, theta1 = theta1,
res[theta2 = theta2, n = res$n[2],
design = design), 4)
<- grouping(capacity, res$n[2], design, equal, do.rate)
x 2, 2] <- x[["grps"]]
res[<- x[["ngrp"]]
ngrp 2, 3] <- paste(ngrp, collapse = " | ")
res[2, 5] <- power.TOST.sds(CV = CV, theta0 = theta0, theta1 = theta1,
res[theta2 = theta2, n = res$n[2],
design = design, grps = res$grps, ngrp = ngrp,
gmodel = gmodel, progress = FALSE, nsims = nsims)
2, 6] <- 100 * (res$m.2[2] - res$m.1[2]) / res$m.1[2]
res[<- rep(NA_integer_, 2)
dfs <- res$n[2]
n 1] <- as.integer(eval(parse(text = df)))
dfs[2] <- as.integer(dfs[1] - (res$grps[2] - 1))
dfs[<- paste(txt, paste0("\n", paste(rep("—", 46), collapse = "")),
txt "\nAnticip. dropout-rate:",
sprintf("%2g%%", 100 * do.rate),
"\nAdjusted sample size :", res$n[2],
"\nNumber of groups :", sprintf("%2.0f", res[2, 2]),
"\nSubjects per group :", res[2, 3],
"\nDegrees of freedom :",
sprintf("%3i", dfs[1]), "(pooled model 4)",
"\n ",
sprintf("%3i", dfs[2]), "(group model 5)")
}4] <- sprintf("%6.4f", res[, 4])
res[, 5] <- sprintf("%6.4f", res[, 5])
res[, 6] <- sprintf("%+.3f%%", res[, 6])
res[, names(res)[4:6] <- c("pooled model", "group model", "change")
<- paste0(txt, paste0("\n", paste(rep("—", 46), collapse = "")),
txt "\nAchieved power of the pooled and group models;",
"\nrelative change in power of the group model",
"\ncompared to the pooled model:\n\n")
if (show) cat(txt)
if (do.rate == 0) {
print(res[, c(1, 4:6)], row.names = FALSE)
else {
}row.names(res) <- c("Expected", "Adjusted")
print(res[, c(1, 4:6)])
}return(result = list(power = res, df = dfs))
}
top of section ↩︎ previous section ↩︎
Say, the assumed CV is 30%, the T/R-ratio 0.95, and we plan the study for ≥ 80% power in a 2×2×2 crossover design. The capacity of the clinical site is 24. We anticipate a dropout-rate of 5% and adjust the sample size accordingly (i.e., dose more subjects in order to have at least as many eligible subjects than estimated for the target power). We want to have at least one group with the capacity of the site.
<- 0.30
CV <- 0.95
theta0 <- 0.80
target <- "2x2x2"
design <- 24
capacity <- 0.05
do.rate <- sim.groups(CV = CV, theta0 = theta0, target = target,
x design = design, capacity = capacity, do.rate = do.rate)
# Attempting to have at least one group with
# the maximum capacity of the clinical site.
# CV : 0.3000
# theta0 : 0.9500
# BE-limits : 0.8000 – 1.2500
# Target power : 0.80
# Design : 2x2x2
# Clinical capacity : 24
# ——————————————————————————————————————————————
# Total sample size : 40
# Number of groups : 2
# Subjects per group : 24 | 16
# Degrees of freedom : 38 (pooled model 4)
# 37 (group model 5)
# ——————————————————————————————————————————————
# Anticip. dropout-rate: 5%
# Adjusted sample size : 44
# Number of groups : 2
# Subjects per group : 24 | 20
# Degrees of freedom : 42 (pooled model 4)
# 41 (group model 5)
# ——————————————————————————————————————————————
# Achieved power of the pooled and group models;
# relative change in power of the group model
# compared to the pooled model:
#
# n pooled model group model change
# Expected 40 0.8158 0.8120 -0.466%
# Adjusted 44 0.8508 0.8498 -0.116%
In the group model \(\small{(5)}\) we loose only one degree of freedom compared to the pooled model \(\small{(4)}\). Hence, the loss in power is negligble.
Let us try a four-period full replicate design. Since the periods double, we anticipate a higher dropout-rate of 10%.
<- 0.30
CV <- 0.95
theta0 <- 0.80
target <- "2x2x4"
design <- 0.1
do.rate <- 24
capacity <- sim.groups(CV = CV, theta0 = theta0, target = target,
x design = design, do.rate = do.rate, capacity = capacity)
# Error: The adjusted sample size of 24 does not exhaust the clinical capacity.
Although we anticipated a higher dropout-rate, we could readily perform the study in a single group. Would there be problems with power?
<- function(n, ns) {
make.equal # make equally sized sequences
return(as.integer(ns * (n %/% ns + as.logical(n %% ns))))
}<- function(n, do.rate, ns) {
nadj # adjust the sample size
return(as.integer(make.equal(n / (1 - do.rate), ns)))
}<- 0.30
CV <- 0.95
theta0 <- 0.80
target <- "2x2x4"
design <- 0.1
do.rate <- as.integer(substr(design, 3, 3))
ns <- sampleN.TOST(CV = CV, theta0 = theta0, targetpower = target,
x design = design, print = FALSE, details = FALSE)[7:8]
<- nadj(x[["Sample size"]], do.rate, ns):x[["Sample size"]]
elig <- data.frame(eligible = elig, power = NA_real_)
x for (i in seq_along(elig)) {
$power[i] <- suppressMessages(
xpower.TOST(CV = CV, theta0 = theta0,
n = elig[i], design = design))
}print(signif(x, 4), row.names = FALSE)
# eligible power
# 24 0.8819
# 23 0.8682
# 22 0.8543
# 21 0.8374
# 20 0.8202
Not at all. Furthermore, in the full replicate design performed in a single group we have more degrees of freedom than in the 2×2×2 crossover performed in two groups (see the table above).
<- 0.335
CV <- CV2mse(CV)
var <- c(44, 40)
n.cross <- c(24, 20)
n.repl <- data.frame(design = rep("2x2x2", 2),
cross groups = rep(2, 2),
n = n.cross,
df = n.cross - 2 - (2 - 1),
CL.lo = NA_real_, CL.hi = NA_real_)
<- data.frame(design = rep("2x2x4", 2),
repl groups = rep(1, 2),
n = n.repl,
df = 3 * n.repl - 4,
CL.lo = NA_real_, CL.hi = NA_real_)
for (i in 1:2) {
5:6] <- exp(log(theta0) - log(1) + c(-1, +1) *
cross[i, qt(1 - 0.05, df = cross$df[i]) *
sqrt(var / 2 * (1 / n.cross[1] + 1 / n.cross[2])))
5:6] <- exp(log(theta0) - log(1) + c(-1, +1) *
repl[i, qt(1 - 0.05, df = repl$df[i]) *
sqrt(var / 4 * (1 / n.repl[1] + 1 / n.repl[2])))
}$width <- cross$CL.hi - cross$CL.lo
cross$width <- repl$CL.hi - repl$CL.lo
repl<- cross$width - repl$width
diffs <- rbind(repl, cross)
x names(x)[2:3] <- c("group(s)", "N")
2] <- sprintf("%.0f ", x[, 2])
x[, 5] <- sprintf("%.3f%%", 100 * x[, 5])
x[, 6] <- sprintf("%.3f%%", 100 * x[, 6])
x[, 7] <- sprintf("%.3f%%", 100 * x[, 7])
x[, print(x, row.names = FALSE)
cat("Maximum difference in the width of CIs:",
sprintf("%.2g%%.\n",100 * max(diffs)))
# design group(s) N df CL.lo CL.hi width
# 2x2x4 1 24 68 87.492% 103.152% 15.660%
# 2x2x4 1 20 56 87.471% 103.177% 15.707%
# 2x2x2 2 44 41 87.277% 103.406% 16.128%
# 2x2x2 2 40 37 87.259% 103.428% 16.169%
# Maximum difference in the width of CIs: 0.47%.
Hence, the CIs in the replicate design will be ≈ 0.5% narrower than in the 2×2×2 crossover.
Say, we have to deal with a narrow therapeutic index drug
(BE-limits 90.00 – 111.11%) with low
variability of 7%. We assume a T/R-ratio of 0.975 and target ≥ 80% power
in a 2×2×2 crossover design. Furthermore, we anticipate a dropout-rate
of 5%. Due to potential AEs close
monitoring of subjects is required and the clinical site is equipped
with eight fully monitored beds. By specifying equal = TRUE
the script attempts to generate equally sized groups.
<- 0.07
CV <- 0.975
theta0 <- 0.90
theta1 <- 0.80
target <- "2x2x2"
design <- 0.05
do.rate <- 8
capacity <- sim.groups(CV = CV, theta0 = theta0, theta1 = theta1,
x target = target, design = design,
do.rate = do.rate, capacity = capacity,
equal = TRUE)
# Note: Imbalanced sequences in groups (7|7) corrected.
# Attempting to generate equally sized groups.
# CV : 0.0700
# theta0 : 0.9750
# BE-limits : 0.9000 – 1.1111
# Target power : 0.80
# Design : 2x2x2
# Clinical capacity : 8
# ——————————————————————————————————————————————
# Total sample size : 12
# Number of groups : 2
# Subjects per group : 6 | 6
# Degrees of freedom : 10 (pooled model 4)
# 9 (group model 5)
# ——————————————————————————————————————————————
# Anticip. dropout-rate: 5%
# Adjusted sample size : 14
# Number of groups : 2
# Subjects per group : 8 | 6
# Degrees of freedom : 12 (pooled model 4)
# 11 (group model 5)
# ——————————————————————————————————————————————
# Achieved power of the pooled and group models;
# relative change in power of the group model
# compared to the pooled model:
#
# n pooled model group model change
# Expected 12 0.8274 0.8218 -0.677%
# Adjusted 14 0.8849 0.8824 -0.283%
Two equally sized groups of six subjects would be possible with the
estimated sample size. That’s what we requested. However, for the
adjusted sample size of 14 subjects, groups of seven subjects would be
imbalanced.7 Hence, a note is issued and sequences are
corrected in such a way that both groups are balanced.
As usual, the loss in power by the group model \(\small{(5)}\) compared to the pooled
model \(\small{(4)}\) is less than one percent
and thus, negligble.
We can assess the Type I Error by setting \(\small{\theta_0}\) to one of the
BE limits \(\small{\left\{\theta_1,\theta_2\right\}}\),
i.e., assume that the null hypothesis is true. This can be done
either by simulations with the function power.TOST.sds()
or
with the function power.TOST()
after adjusting the degrees
of freedom as outlined above. We need at least
106 simulations in order to obtain a stable result.
# Cave: long runtime
<- 0.30
CV <- "2x2x2"
design <- 0.80
theta1 <- 1.25
theta2 <- c(22L, 18L)
ngrp <- 2
grps <- sum(ngrp)
n <- 3:2
gmodel <- 1e6
nsims <- c(n, n - (grps - 1))
adj <- data.frame(model = c("pooled", "group"),
x df = c(n - 2, n - 2 - (grps - 1)),
simulated = NA_real_, exact = NA_real_)
for (i in seq_along(gmodel)) {
$simulated[i] <- power.TOST.sds(CV = CV, theta0 = theta2, n = n,
xdesign = design, grps = grps,
ngrp = ngrp, gmodel = gmodel[i],
nsims = nsims, progress = FALSE)
$exact[i] <- suppressMessages(
xpower.TOST(CV = CV, theta0 = theta2,
n = adj[i], design = design))
}cat("CV :", sprintf("%.4f", CV),
"\nBE-limits :",
sprintf("%.4f – %.4f", theta1, theta2),
"\nDesign :", design,
"\nTotal sample size :", n,
"\nNumber of groups :", sprintf("%2.0f", grps),
"\nSubjects per group:", paste(ngrp, collapse = " | "),
"\nNull assessed at :", sprintf("%.4f\n\n", theta2))
print(x, row.names = FALSE, right = FALSE)
# CV : 0.3000
# BE-limits : 0.8000 – 1.2500
# Design : 2x2x2
# Total sample size : 40
# Number of groups : 2
# Subjects per group: 22 | 18
# Null assessed at : 1.2500
#
# model df simulated exact
# pooled 38 0.049801 0.04999975
# group 37 0.049791 0.04999961
Of course, in the pooled model \(\small{(4)}\) the Type I Error is controlled. Due to less degrees of freedom in the group model \(\small{(5)}\) the Type I Error is even lower.
Coming back to the question asked in the Introduction. To repeat:
What is the best approach to deal with the limited capacity of a clinical site?
A member of the EMA’s PKWP once told me that he would like to see all studies performed in a replicate design – regardless whether the drug / drug product is highly variable or not. From a purely statistical perspective, he is right and it is one of the rare cases where we were of the same opinion.
Power (and hence, the sample size) depends on the number of
treatments – the smaller sample size in replicate designs is compensated
by more administrations. If the total sample size of a 2×2×2 crossover
design is \(\small{N}\), then the
sample size of a four-period replicate design is \(\small{\tfrac{1}{2}N}\) and of a
three-period replicate design \(\small{\tfrac{3}{4}N}\). Costs of a
replicate design are similar to the common 2×2×2 crossover design (for
details see another
article).
Nevertheless, smaller sample sizes may come with a price. We will save
costs due to less pre-/post-study exams but have to pay a higher subject
remuneration (more hospitalizations and blood draws). However, we have
the same number of samples to analyze and study costs are driven to a
good part by bioanalytics.9 Furthermore, one must be aware that more
periods / washout phases increase the chance of dropouts.
Testing for a significant group- (or site-) by-treatment interaction should be avoided.
In a nutshell: Model \(\small{(5)}\) is expanded with an interaction term (\(\small{F\times treatment}\)) to \[\eqalign{response&|\;F,\,\;sequence,\,subject(F\times sequence),\\ &\phantom{|}\;period(F),F\times sequence,\,treatment,\,F\times treatment}\tag{9}\]
Regrettably the WHO stated also:16
“In those cases where the subjects are recruited and treated in groups, it is appropriate to investigate the statistical significance of the group-by-formulation interaction e.g., with the following ANOVA model: Group, Sequence, Formulation, Period (nested within Group), Group-by-Sequence interaction, Subject (nested within Group*Sequence) and Group-by-Formulation interaction. If this interaction is significant, the study results are not interpretable. However, it is not considered to be correct to report the results of the 90% confidence interval of the ratio test/comparator based on the standard error derived from this ANOVA. If the group-by-formulation interaction is not significant, the 90% confidence interval should be calculated based on the ANOVA model defined in the protocol. This model may or may not include the group effect as pre-defined in the protocol. This depends on whether the group effect is believed to explain the variability observed in the data.
On the contrary: It is not appropriate!
At least we can state in the protocol that we don’t believe
[sic]17
in a group-by-treatment interaction. Believes don’t belong to
the realm of science – only assumptions do. The problems
arising with \(\small{(9)}\) in terms of power and
inflation of the Type I Error is elaborated in another article.
When I gave a presentation18 about groups/sites in bioequivalence in
front of European regulatory (‼) statisticians, they were asking in
disbelieve »What the heck are you talking about?«
The European Guideline19 states:
“The study should be designed in such a way that the formulation effect can be distinguished from other effects.
The precise model to be used for the analysis should be pre-specified in the protocol. The statistical analysis should take into account sources of variation that can be reasonably assumed to have an effect on the response variable.
Is it reasonable to assume that groups have an effect on the PK response (i.e., a true group-by-treatment interaction)? IMHO, not at all. Recall that we have the same inclusion- & exclusion-criteria and hence, similar demographics, the same clinical site, quite often groups separated by only a couple of days, and analyze samples with the same bioanalytical method. In a crossover study each subject acts as its own control, right?
If we follow the approach outlined above, we will face a false positive rate at the level of the test \(\small{\alpha}\) and will not be allowed to use \(\small{(4)}\) or \(\small{(5)}\) of pooled data in ~10% of cases. If that happens, the loss in power will be tremendous. Returning to the example from above:
<- "2x2x2"
design <- 0.95
theta0 <- 0.30
CV <- 2
grps <- c(24, 16)
ngrp <- data.frame(model = c("pooled", "groups", "large group"),
res n = c(rep(sum(ngrp), 2), ngrp[1]),
grps = c(rep(grps, 2), 1),
n.grp = c(rep(paste(ngrp, collapse = "|"), 2), ngrp[1]),
power = NA_real_, loss = NA_real_)
$power[1] <- power.TOST(CV = CV, theta0 = theta0,
resn = res$n[1], design = design)
$power[2] <- power.TOST.sds(CV = CV, theta0 = theta0, n = res$n[2],
resdesign = design, grps = grps, ngrp = ngrp,
gmodel = 2, progress = FALSE)
$power[3] <- power.TOST(CV = CV, theta0 = theta0,
resn = ngrp[1], design = design)
$loss[2:3] <- sprintf("%4.1f%%",
res100 * abs((res$power[2:3] - res$power[1]) /
$power[1]))
res$loss[is.na(res$loss)] <- "- "
res3] <- sprintf("%.0f ", res[, 3])
res[, 1] <- c("pooled (4)", "groups (5)", "large group only (4)")
res[$power <- signif(res$power, 4)
resnames(res)[c(2:4)] <- c("N", "group(s)", "n/group")
print(res, row.names = FALSE)
# model N group(s) n/group power loss
# pooled (4) 40 2 24|16 0.8158 -
# groups (5) 40 2 24|16 0.8120 0.5%
# large group only (4) 24 1 24 0.5577 31.6%
Given, that would happen only in ~10% of studies but if it does,
power will be hardly larger than tossing a coin.
A meta-study20 showed that the
overall loss in power in an evaluation by \(\small{(5)}\)
instead of by \(\small{(4)}\) can reach ~7%.
Hundreds (or thousands‽) of studies have been performed in multiple groups, evaluated by \(\small{(4)}\) and were accepted by European agencies in the twinkling of an eye.
Helmut Schütz 2024
R
and PowerTOST
GPL 3.0,
pandoc
GPL 2.0.
1st version June 28, 2022. Rendered July 2, 2024 10:59 CEST
by rmarkdown
via pandoc in 0.28 seconds.
Labes D, Schütz H, Lang B. PowerTOST: Power and Sample Size for (Bio)Equivalence Studies. Package version 1.5.6. 2024-03-18. CRAN.↩︎
FDA, CDER. Guidance for Industry. Statistical Approaches to Establishing Bioequivalence. Rockville. January 2001. Download.↩︎
FDA, CDER. Draft Guidance for Industry. Statistical Approaches to Establishing Bioequivalence. Silver Spring. December 2022. Revision 1. Download.↩︎
ICH. Bioequivalence for Immediate Release Solid Oral Dosage Forms. M13A. Draft version. 20 December 2022. Online.↩︎
Statisticians may be more familiar with the terminology
\(\small{\textrm{AB}|\textrm{BA}}\).
Forgive me – in bioequivalence \(\small{\textrm{TR}|\textrm{RT}}\) is
commonly used, where \(\small{\textrm{T}}\) denotes the Test and
\(\small{\textrm{R}}\) the Reference
treatment (formulation).↩︎
If Subject 1 is randomized to sequence \(\small{\text{TR}}\), there is not ‘another’ Subject 1 randomized to sequence \(\small{\text{RT}}\). Randomization is not like Schrödinger’s cat. Hence, the nested term in the guidelines is an insult to the mind.↩︎
Not the same number of subjects in sequence \(\small{\textrm{TR}}\) than in sequence \(\small{\textrm{RT}}\).↩︎
Of course, not in the \(\small{\textrm{TTRR|RRTR}}\) full replicate design. However, it is rarely employed in practice.↩︎
In case of a ‘poor’ bioanalytical method requiring a large sample volume: Since the total blood sampling volume is generally limited with the one of a blood donation, one may opt for a three-period full replicate or has to measure HCT prior to administration in higher periods and – for safety reasons – exclude subjects if their HCT is too high.↩︎
Since this is a between-subject comparison, it has low power. IMHO, the 10% level is arbitrary (I failed to find any justification).↩︎
What if more than one group have the same size? Has BE to be demonstrated for all of them? Quite possible, since that would be the most conservative approach. However, I failed to find a reference. That’s one of the reasons I prefer to have at least one group in the maximum capacity of the site instead of equally sized groups.↩︎
If you have – really! – nothing better to do (long
runtime): Try the examples with the additional argument
gmodel = 1
in the function. In the first example 5.3–5.7%
power will be lost and in the second 3.6–6.0%.↩︎
Schütz H. Multi-Group Studies in Bioequivalence. To pool or not to pool? Presentation at: BioBridges 2018. Prague. 26–27 September 2018. Online.↩︎
For the examples I got in 106 simulations (extremely long runtime) an inflated Type I Error of ≈ 0.063.↩︎
In a meeting in April 2021 the FDA accepted my concerns about an inflated Tpye I Error and agreed that in a planned multi-center study model \(\small{(5)}\) without a pre-test should be used.↩︎
WHO. Frequent deficiencies in BE study protocols. Geneva. November 2020. Online.↩︎
Quoting my late father: »If you believe, go to church.«↩︎
Schütz H. Multi-Group and Multi-Site Studies. To pool or not to pool? 2nd Annual Biosimilars Forum. Satellite Short Course. Budapest. 5 October 2017. Online.↩︎
EMA, CHMP. Guideline on the Investigation of Bioequivalence. London. 20 January 2010. Online.↩︎
Schütz H, Burger DA, Cobo E, Dubins D, Farkas T, Labes D, Lang B, Ocaña J, Ring A, Shitova A, Stus V, Tomashevskiy M. Group-by-Treatment Interaction Effects in Comparative Bioavailability Studies. AAPS J. 2024; 26(3): 50. doi:10.1208/s12248-024-00921-x. Open Access. Supplementary Material.↩︎