Consider allowing JavaScript. Otherwise, you have to be proficient in reading since formulas will not be rendered. Furthermore, the table of contents in the left column for navigation will not be available and code-folding not supported. Sorry for the inconvenience.

Examples in this article were generated with
4.4.1
by the package `PowerTOST`

.^{1}

- The right-hand badges give the ‘level’ of the respective section.

- Basics about sample size methodology and study designs – requiring no or only limited statistical expertise.

- These sections are the most important ones. They are – hopefully – easily comprehensible even for novices.

- A somewhat higher knowledge of statistics and/or R is required. May be skipped or reserved for a later reading.

- An advanced knowledge of statistics and/or R is required. Not recommended for beginners in particular.

- Click to show / hide R code.
- Click on the icon in the top left corner to copy R code to the clipboard.

Abbreviation | Meaning |
---|---|

\(\small{\alpha}\) | Level of the test |

BE | Bioequivalence |

CI | Confidence Interval |

\(\small{CV}\) |
Within-subject Coefficient of Variation (paired, crossover, replicate
designs), total CV (parallel design) |

\(\small{\Delta}\) | Clinically relevant difference |

\(\small{H_0}\) | Null hypothesis |

\(\small{H_1}\) | Alternative hypothesis (also \(\small{H_\textrm{a}}\)) |

\(\small{\theta_0=\mu_\textrm{T}/\mu_\textrm{R}}\) | True T/R-ratio |

\(\small{\theta_1,\;\theta_2}\) | Lower and upper limits of the acceptance range (BE margins) |

What is the best approach to deal with the limited capacity of a clinical site?

The most simple – and *preferable* – approach is to find a
clinical site which is able to accommodate all subjects at once. If this
is not possible, subjects could be allocated to multiple
groups (aka cohorts) or sites.
Whether or not a group- (site-) term should by included in the
statistical model is still the topic of heated debates. In the case of
multi-site studies regulators require a modification of the model.^{2} ^{3} ^{4}

Since in replicate designs less subjects are required to achieve the same power than in conventional 2×2×2 crossover design, sometimes multi-group or multi-site studies can be avoided.

A *basic* knowledge of R is
required. To run the scripts at least version 1.4.8 (2019-08-29) of
`PowerTOST`

is suggested. Any version of R would likely do, though the current release of
`PowerTOST`

was only tested with version 4.3.3 (2024-02-29)
and later.

All scripts were run on a Xeon E3-1245v3 @ 3.40GHz (1/4 cores) 16GB RAM
with R 4.4.1 on Windows 7 build 7601, Service
Pack 1, Universal C Runtime 10.0.10240.16390.

The examples deal primarily with the 2×2×2 crossover design (\(\small{\textrm{TR}|\textrm{RT}}\))^{5} but the
concept is applicable to any kind of crossover (Higher-Order, replicate
designs) or parallel designs assessed for equivalence.

In the following I consider *groups*, *i.e.*, studies
performed in a single site. However, the concept is applicable for
studies performed in multiple *sites* as well. Studies in groups
in multiple sites are out of scope of this article.

The lower and upper limits of the bioequivalence (BE) range \(\small{\{\theta_1,\theta_2\}}\) are
calculated based on the *‘clinically not relevant difference’*
\(\small{\Delta}\) \[\left\{\theta_1=100\,(1-\Delta),\,\theta_2=100\,(1-\Delta)^{-1}\right\}\tag{1}\]
Commonly \(\small{\Delta}\) is set to
0.20. For narrow therapeutic index drugs (the
EMA and other
jurisdictions) \(\small{\Delta}\) is
set to 0.10, and for \(\small{C_\textrm{max}}\) (Russian
Federation, the EEU, South
Africa) \(\small{\Delta}\) is set to
0.25. We obtain: \[\begin{matrix}\tag{2}
\left\{\theta_1=80.00\%,\,\theta_2=125.00\%\right\}\\
\left\{\theta_1=90.00\%,\,\theta_2=111.1\dot{1}\%\right\}\\
\left\{\theta_1=75.00\%,\,\theta_2=133.3\dot{3}\%\right\}
\end{matrix}\]

Conventionally BE is assessed by the confidence interval inclusion approach: \[H_0:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}\not\subset\left\{\theta_1,\theta_2\right\}\:vs\:H_1:\theta_1<\frac{\mu_\textrm{T}}{\mu_\textrm{R}}<\theta_2\tag{3}\]

As long as the \(\small{100\,(1-2\,\alpha)}\) confidence
interval lies entirely within pre-specified limits \(\small{\{\theta_1,\theta_2\}}\), the null hypothesis of
__in__equivalence in \(\small{(3)}\) is rejected and the
alternative hypothesis of equivalence in \(\small{(3)}\) is accepted.

The model for a crossover design in average bioequivalence as stated
in guidelines is \[\eqalign{response&|\;sequence,\,subject(sequence),\\
&\phantom{|}\;period,\,treatment}\tag{4}\] Most
regulations recommend an
ANOVA, *i.e.*, all
effects fixed. The
FDA and Health
Canada recommend a mixed-effects model, *i.e.*, to specify \(\small{subject(sequence)}\) as a random
effect and all others fixed.

It must be mentioned that in comparative bioavailability studies
subjects are usually uniquely coded. Hence, the nested term \(\small{subject(sequence)}\) in \(\small{(4)}\) is a
bogus one^{6} and could be replaced by the simple \(\small{subject}\) as well. See also another
article.

If the study is performed in multiple groups or sites, model \(\small{(4)}\) can be modified to \[\eqalign{response&|\;F,\,\;sequence,\,subject(F\times sequence),\\ &\phantom{|}\;period(F),F\times sequence,\,treatment\small{\textsf{,}}}\tag{5}\] where \(\small{F}\) is the code for \(\small{group}\) or \(\small{site}\), respectively.

When \(\small{N_\text{F}}\) is the
number of groups or sites, in \(\small{(5)}\) there are \(\small{N_\text{F}-1}\) less residual
degrees of freedom than in \(\small{(4)}\). The table below gives
the designs implemented in `PowerTOST`

.

\[\small{\begin{array}{l|lccccrr}
\phantom{00000000}\textsf{design} & \phantom{00}\textsf{code} &
\textsf{t} & \textsf{s} & \textsf{p} & \textsf{m} &
(4)\phantom{0} & (5)\phantom{00000000}\\\hline
\textsf{Parallel} & \textsf{parallel} & 2 & - & 1 &
1 & N-2 & N-2-(N_\text{F}-1)\\
\textsf{Paired means} & \textsf{paired} & 2 & - & 2
& 2 & N-1 & N-1-(N_\text{F}-1)\\
\textsf{Crossover} & \textsf{2}\times\textsf{2}\times\textsf{2}
& 2 & 2 & 2 & \tfrac{1}{2} & N-2 &
N-2-(N_\text{F}-1)\\
\textsf{2-sequence 3-period full replicate} &
\textsf{2}\times\textsf{2}\times\textsf{3} & 2 & 2 & 3 &
\tfrac{3}{8} & 2N-3 & 2N-3-(N_\text{F}-1)\\
\textsf{2-sequence 4-period full replicate} &
\textsf{2}\times\textsf{2}\times\textsf{4} & 2 & 2 & 4 &
\tfrac{1}{4} & 3N-4 & 3N-4-(N_\text{F}-1)\\
\textsf{4-sequence 4-period full replicate} &
\textsf{2}\times\textsf{4}\times\textsf{4} & 2 & 4 & 4 &
\tfrac{1}{16} & 3N-4 & 3N-4-(N_\text{F}-1)\\
\textsf{Partial replicate} &
\textsf{2}\times\textsf{3}\times\textsf{3} & 2 & 3 & 3 &
\tfrac{1}{6} & 2N-2 & 2N-3-(N_\text{F}-1)\\
\textsf{Balaam}'\textsf{s} &
\textsf{2}\times\textsf{4}\times\textsf{2} & 2 & 4 & 2 &
\tfrac{1}{2} & N-2 & N-2-(N_\text{F}-1)\\\textsf{Latin Square or
Williams}' & \textsf{3}\times\textsf{3} & 3 & 3 & 3
& \tfrac{2}{9} & 2N-4 & 2N-4-(N_\text{F}-1)\\
\textsf{Williams}' & \textsf{3}\times\textsf{6}\times\textsf{3}
& 3 & 6 & 3 & \tfrac{1}{18} & 2N-4 &
2N-4-(N_\text{F}-1)\\
\textsf{Latin Square or Williams}' & \textsf{4}\times\textsf{4}
& 4 & 4 & 4 & \tfrac{1}{8} & 3N-6 &
3N-6-(N_\text{F}-1)\\
\end{array}}\] \(\small{\textsf{code}}\) is the
`design`

-argument in the functions of `PowerTOST`

.
\(\small{\textsf{t, s, p}}\) are the
number of treatments, sequences, and periods, respectively. \(\small{\textsf{m}}\) is the multiplier in
the radix of \(\small{(6)}\) and \(\small{N}\) is the total number of
subjects, *i.e.*, \(\small{N=\sum_{i=1}^{i=s}n_i}\).

The back-transformed \(\small{1-2\,\alpha}\) Confidence Interval (CI) is calculated by \[CI=100\,\exp\left(\log_{e}\overline{x}_\text{T}-\log_{e}\overline{x}_\text{R}\mp t_{df,\alpha}\sqrt{m \times\widehat{s^2}\sum_{i=1}^{i=s}\frac{1}{n_i}}\right)\small{\textsf{,}}\tag{6}\] where \(\widehat{s^2}\) is the residual variance of the model and \(\small{n_i}\) is the number of subjects in the \(\small{i^\text{th}}\) of \(\small{s}\) sequences.

`library(PowerTOST) # attach it to run the examples`

In a 2×2×2 crossover design the residual
degrees of freedom in the models are \[\eqalign{(4):df&=N-2\\
(5):df&=N-2-(N_\text{F}-1)\small{\textsf{,}}}\tag{7}\]
where \(\small{N}\) is total number of
subjects. Therefore, the CI by
\(\small{(6)}\)
for given \(\small{N}\) and \(\small{\widehat{s^2}}\) by \(\small{(5)}\) will
always be wider (more conservative) than the one by \(\small{(4)}\) due
to the fewer degrees of freedom and hence, larger \(\small{t}\)-value.

However, unless the sample size is not small and the number of groups
(or sites) large, the difference in widths of the
CIs is generally
neglegible.

```
<- "2x2x2"
design <- 0.95
theta0 <- 0.335
CV <- CV2mse(CV)
var <- sampleN.TOST(CV = CV, theta0 = theta0, design = design,
n print = FALSE)[["Sample size"]]
<- n2 <- n / 2
n1 <- c(2:4, 6, 8)
NF <- data.frame(n = n, NF = NF,
x df.4 = n - 2,
CL.lo.4 = NA_real_, CL.hi.4 = NA_real_,
width.4 = NA_real_,
df.5 = n - 2 - (NF - 1),
CL.lo.5 = NA_real_, CL.hi.5 = NA_real_,
width.5 = NA_real_)
for (i in seq_along(NF)) {
4:5] <- exp(log(theta0) - log(1) + c(-1, +1) *
x[i, qt(1 - 0.05, df = x$df.4[i]) *
sqrt(var / 2 * (1 / n1 + 1 / n2)))
8:9] <- exp(log(theta0) - log(1) + c(-1, +1) *
x[i, qt(1 - 0.05, df = x$df.5[i]) *
sqrt(var / 2 * (1 / n1 + 1 / n2)))
}$width.4 <- x$CL.hi.4 - x$CL.lo.4
x$width.5 <- x$CL.hi.5 - x$CL.lo.5
x<- x$width.5 - x$width.4
diffs names(x)[c(1, 3:5, 7:9)] <- c("N", "df (4)", "CL.lo (4)", "CL.hi (4)",
"df (5)", "CL.lo (5)", "CL.hi (5)")
4] <- sprintf("%.3f%%", 100 * x[, 4])
x[, 5] <- sprintf("%.3f%%", 100 * x[, 5])
x[, 8] <- sprintf("%.3f%%", 100 * x[, 8])
x[, 9] <- sprintf("%.3f%%", 100 * x[, 9])
x[, print(x[, c(1:5, 7:9)], row.names = FALSE)
cat("Maximum difference in the width of CIs:",
sprintf("%.2g%%.\n", 100 * max(diffs)))
```

```
# N NF df (4) CL.lo (4) CL.hi (4) df (5) CL.lo (5) CL.hi (5)
# 48 2 46 84.955% 106.232% 45 84.951% 106.238%
# 48 3 46 84.955% 106.232% 44 84.946% 106.243%
# 48 4 46 84.955% 106.232% 43 84.942% 106.249%
# 48 6 46 84.955% 106.232% 41 84.932% 106.262%
# 48 8 46 84.955% 106.232% 39 84.920% 106.276%
# Maximum difference in the width of CIs: 0.079%.
```

If we have a study with 48 subjects in a 2×2×2 crossover design performed in two to eight groups (or sites), the difference in the widths of CIs is negligible in any case.

In a two-sequence four-period full replicate design the residual degrees of freedom in the models are \[\eqalign{(4):df&=3N-4\\ (5):df&=3N-4-(N_\text{F}-1)\small{\textsf{,}}}\tag{8}\]where \(\small{N}\) is total number of subjects. We need only half of the subjects because power depends on the number of treatments, which is the same like in the 2×2×2 crossover. Again, the CI by \(\small{(6)}\) for given \(\small{N}\) and \(\small{\widehat{s^2}}\) by \(\small{(5)}\) will always be wider than the one by \(\small{(4)}\).

```
<- "2x2x4"
design <- 0.95
theta0 <- 0.335
CV <- CV2mse(CV)
var <- sampleN.TOST(CV = CV, theta0 = theta0, design = design,
n print = FALSE)[["Sample size"]]
<- n2 <- n / 2
n1 <- c(2:4, 6)
NF <- data.frame(n = n, NF = NF,
x df.4 = 3 * n - 4,
CL.lo.4 = NA_real_, CL.hi.4 = NA_real_,
width.4 = NA_real_,
df.5 = 3 * n - 4 - (NF - 1),
CL.lo.5 = NA_real_, CL.hi.5 = NA_real_,
width.5 = NA_real_)
for (i in seq_along(NF)) {
4:5] <- exp(log(theta0) - log(1) + c(-1, +1) *
x[i, qt(1 - 0.05, df = x$df.4[i]) *
sqrt(var / 4 * (1 / n1 + 1 / n2)))
8:9] <- exp(log(theta0) - log(1) + c(-1, +1) *
x[i, qt(1 - 0.05, df = x$df.5[i]) *
sqrt(var / 4 * (1 / n1 + 1 / n2)))
}$width.4 <- x$CL.hi.4 - x$CL.lo.4
x$width.5 <- x$CL.hi.5 - x$CL.lo.5
x<- x$width.5 - x$width.4
diffs names(x)[c(1, 3:5, 7:9)] <- c("N", "df (4)", "CL.lo (4)", "CL.hi (4)",
"df (5)", "CL.lo (5)", "CL.hi (5)")
4] <- sprintf("%.3f%%", 100 * x[, 4])
x[, 5] <- sprintf("%.3f%%", 100 * x[, 5])
x[, 8] <- sprintf("%.3f%%", 100 * x[, 8])
x[, 9] <- sprintf("%.3f%%", 100 * x[, 9])
x[, print(x[, c(1:5, 7:9)], row.names = FALSE)
cat("Maximum difference in the width of CIs:",
sprintf("%.2g%%.\n",100 * max(diffs)))
```

```
# N NF df (4) CL.lo (4) CL.hi (4) df (5) CL.lo (5) CL.hi (5)
# 24 2 68 85.018% 106.154% 67 85.016% 106.156%
# 24 3 68 85.018% 106.154% 66 85.014% 106.159%
# 24 4 68 85.018% 106.154% 65 85.012% 106.161%
# 24 6 68 85.018% 106.154% 63 85.008% 106.167%
# Maximum difference in the width of CIs: 0.023%.
```

Here the difference in the widths of CIs is even smaller than in the 2×2×2 crossover due to the larger degrees of freedom.

The function `power.TOST.sds()`

of `PowerTOST`

supports simulations of \(\small{(5)}\) of studies in a 2×2×2
crossover, as well as in full and partial replicate designs.

The script supports sample size estimation for Average Bioequivalence
and generation of groups based on the clinical capacity, where sequences
within groups will be balanced. The estimated sample size can be
adjusted based on an anticipated dropout-rate. If this is selected
(`do.rate > 0`

) two simulations are performed: One for the
adjusted sample size (optimistic: no dropouts) and one for the estimated
sample size (pessimistic: dropout-rate realized).

Two options for the generation of groups are supported by the logical
argument `equal`

:

`FALSE`

: Attempts to generate*at least*one group with the maximum size of the clinical site (default).`TRUE`

: Attempts to generate equally sized groups.

Last but not least power by \(\small{(5)}\) is compared to exact
power by \(\small{(4)}\).

As we have seen in the table above, due to its
lower degrees of freedom power of model \(\small{(5)}\) should always be lower
than the one of model \(\small{(4)}\). If you get a positive
`change`

value in the comparison, it is a simulation
artifact. In such a case, increase the number of simulations
(`nsims = 1e6`

or higher). Cave: 168
LOC.

```
<- function(CV, theta0 = 0.95, theta1, theta2, target = 0.80,
sim.groups design = "2x2x2", capacity, equal = FALSE,
gmodel = 2, do.rate = 0, nsims = 1e5L, show = TRUE) {
##########################################################
# Explore the impact on power of a group model com- #
# pared to the conventional model of pooled data via #
# simulations. #
# ------------------------------------------------------ #
# capacity: maximum capacity of the clinical site #
# equal: TRUE : tries to generate equally sized #
# groups #
# FALSE: tries to get at least one group with #
# the maximum size of the clinical site #
# (default) #
# do.rate: anticipated dropout-rate; if > 0 a second #
# simulation is performed based on the adjust- #
# ed sample size #
##########################################################
if (missing(theta1) & missing(theta2)) theta1 <- 0.80
if (missing(theta1)) theta1 <- 1 / theta2
if (missing(theta2)) theta2 <- 1 / theta1
if (theta0 < theta1 | theta0 > theta2)
stop("theta0 must be within {theta1, theta2}.", call. = FALSE)
if (theta0 == theta1 | theta0 == theta2)
stop("Simulation of the Type I Error not supported.",
"\n Use power.TOST.sds() directly.", call. = FALSE)
if (!design %in% c("2x2", "2x2x2", "2x3x3", "2x2x4", "2x2x3"))
stop("Design \"", design, "\" not implemented.", call. = FALSE)
<- as.integer(substr(design, 3, 3))
ns <- function(n, ns) {
make.equal # make equally sized sequences
return(as.integer(ns * (n %/% ns + as.logical(n %% ns))))
}<- function(n, do.rate, ns) {
nadj # adjust the sample size
return(as.integer(make.equal(n / (1 - do.rate), ns)))
}<- function(capacity, n, design, equal, do.rate) {
grouping # based on the sample size and capacity, calculate the
# number of groups and subjects / group
if (do.rate == 0) {
<- n
x <- paste("The sample size of", x)
stop.txt else {
}<- nadj(n, do.rate, ns)
x <- paste("The adjusted sample size of", x)
stop.txt
}if (x <= capacity) {
stop(stop.txt, " does not exhaust the clinical capacity.",
call. = FALSE)
else {
}# split sample size into >=2 groups based on capacity
# TODO: Handle a case where with no dropouts <= capacity (no grouping)
# but with the adjusted sample size two groups are required
if (equal) {# attempt to make all groups equally sized
<- ceiling(n / capacity)
grps <- rep(n / grps, grps)
tmp <- make.equal(tmp, ns)
ngrp if (!isTRUE(all.equal(tmp, ngrp)))
message("Note: Imbalanced sequences in groups (",
paste(round(tmp, 0), collapse = "|"), ") corrected.\n")
if (sum(ngrp) > n) ngrp[length(ngrp)] <- n-sum(ngrp[-1])
else { # at least one group = capacity
}<- rep(0, ceiling(n / capacity))
ngrp <- length(ngrp)
grps 1] <- capacity
ngrp[for (j in 2:grps) {
<- sum(ngrp) # what we have so far
n.tot if (n.tot + capacity <= n) {
<- capacity
ngrp[j] else {
}<- n - n.tot
ngrp[j]
}
}
}
}return(ngrp = list(grps = length(ngrp), ngrp = ngrp))
}if (equal) {
<- "Attempting to generate equally sized groups."
txt else {
}<- paste("Attempting to have at least one group with",
txt "\nthe maximum capacity of the clinical site.")
}<- paste(txt, "\nCV :", sprintf("%.4f", CV),
txt "\ntheta0 :", sprintf("%.4f", theta0),
"\nBE-limits :",
sprintf("%.4f – %.4f", theta1, theta2),
"\nTarget power :", sprintf("%.2f", target),
"\nDesign :", design,
"\nClinical capacity :", capacity)
<- sampleN.TOST(CV = CV, theta0 = theta0, theta1 = theta1,
tmp theta2 = theta2, targetpower = target,
design = design, print = FALSE)
<- data.frame(n = NA_integer_, grps = NA_integer_,
res n.grp = NA_integer_, m.1 = NA_real_, m.2 = NA_real_,
change = NA_real_)
$n <- tmp[["Sample size"]]
res4] <- tmp[["Achieved power"]]
res[<- grouping(capacity, res$n, design, equal, do.rate)
x 2] <- x[["grps"]]
res[<- x[["ngrp"]]
ngrp 3] <- paste(ngrp, collapse = " | ")
res[5] <- power.TOST.sds(CV = CV, theta0 = theta0, theta1 = theta1,
res[theta2 = theta2, n = res$n,
design = design, grps = res$grps, ngrp = ngrp,
gmodel = gmodel, progress = FALSE, nsims = nsims)
6] <- 100 * (res$m.2 - res$m.1) / res$m.1
res[<- known.designs()[2:3]
des <- as.expression(des$df[des$design == design])
df <- rep(NA_integer_, 2)
dfs <- res$n
n 1] <- as.integer(eval(parse(text = df)))
dfs[2] <- as.integer(dfs[1] - (res$grps - 1))
dfs[<- paste(txt, paste0("\n", paste(rep("—", 46), collapse = "")),
txt "\nTotal sample size :", res$n,
"\nNumber of groups :", sprintf("%2.0f", res[2]),
"\nSubjects per group :", res[3],
"\nDegrees of freedom :",
sprintf("%3i", dfs[1]), "(pooled model 4)",
"\n ",
sprintf("%3i", dfs[2]), "(group model 5)")
if (do.rate > 0) {
2, 1] <- nadj(n, do.rate, ns)
res[2, 4] <- signif(power.TOST(CV = CV, theta0 = theta0, theta1 = theta1,
res[theta2 = theta2, n = res$n[2],
design = design), 4)
<- grouping(capacity, res$n[2], design, equal, do.rate)
x 2, 2] <- x[["grps"]]
res[<- x[["ngrp"]]
ngrp 2, 3] <- paste(ngrp, collapse = " | ")
res[2, 5] <- power.TOST.sds(CV = CV, theta0 = theta0, theta1 = theta1,
res[theta2 = theta2, n = res$n[2],
design = design, grps = res$grps, ngrp = ngrp,
gmodel = gmodel, progress = FALSE, nsims = nsims)
2, 6] <- 100 * (res$m.2[2] - res$m.1[2]) / res$m.1[2]
res[<- rep(NA_integer_, 2)
dfs <- res$n[2]
n 1] <- as.integer(eval(parse(text = df)))
dfs[2] <- as.integer(dfs[1] - (res$grps[2] - 1))
dfs[<- paste(txt, paste0("\n", paste(rep("—", 46), collapse = "")),
txt "\nAnticip. dropout-rate:",
sprintf("%2g%%", 100 * do.rate),
"\nAdjusted sample size :", res$n[2],
"\nNumber of groups :", sprintf("%2.0f", res[2, 2]),
"\nSubjects per group :", res[2, 3],
"\nDegrees of freedom :",
sprintf("%3i", dfs[1]), "(pooled model 4)",
"\n ",
sprintf("%3i", dfs[2]), "(group model 5)")
}4] <- sprintf("%6.4f", res[, 4])
res[, 5] <- sprintf("%6.4f", res[, 5])
res[, 6] <- sprintf("%+.3f%%", res[, 6])
res[, names(res)[4:6] <- c("pooled model", "group model", "change")
<- paste0(txt, paste0("\n", paste(rep("—", 46), collapse = "")),
txt "\nAchieved power of the pooled and group models;",
"\nrelative change in power of the group model",
"\ncompared to the pooled model:\n\n")
if (show) cat(txt)
if (do.rate == 0) {
print(res[, c(1, 4:6)], row.names = FALSE)
else {
}row.names(res) <- c("Expected", "Adjusted")
print(res[, c(1, 4:6)])
}return(result = list(power = res, df = dfs))
}
```

top of section ↩︎ previous section ↩︎

Say, the assumed *CV* is 30%, the T/R-ratio 0.95, and we plan
the study for ≥ 80% power in a 2×2×2 crossover design. The capacity of
the clinical site is 24. We anticipate a dropout-rate of 5% and adjust
the sample size accordingly (*i.e.*, dose more subjects in order
to have at least as many eligible subjects than estimated for the target
power). We want to have at least one group with the capacity of the
site.

```
<- 0.30
CV <- 0.95
theta0 <- 0.80
target <- "2x2x2"
design <- 24
capacity <- 0.05
do.rate <- sim.groups(CV = CV, theta0 = theta0, target = target,
x design = design, capacity = capacity, do.rate = do.rate)
```

```
# Attempting to have at least one group with
# the maximum capacity of the clinical site.
# CV : 0.3000
# theta0 : 0.9500
# BE-limits : 0.8000 – 1.2500
# Target power : 0.80
# Design : 2x2x2
# Clinical capacity : 24
# ——————————————————————————————————————————————
# Total sample size : 40
# Number of groups : 2
# Subjects per group : 24 | 16
# Degrees of freedom : 38 (pooled model 4)
# 37 (group model 5)
# ——————————————————————————————————————————————
# Anticip. dropout-rate: 5%
# Adjusted sample size : 44
# Number of groups : 2
# Subjects per group : 24 | 20
# Degrees of freedom : 42 (pooled model 4)
# 41 (group model 5)
# ——————————————————————————————————————————————
# Achieved power of the pooled and group models;
# relative change in power of the group model
# compared to the pooled model:
#
# n pooled model group model change
# Expected 40 0.8158 0.8120 -0.466%
# Adjusted 44 0.8508 0.8498 -0.116%
```

In the group model \(\small{(5)}\) we loose only one degree of freedom compared to the pooled model \(\small{(4)}\). Hence, the loss in power is negligble.

Let us try a four-period full replicate design. Since the periods double, we anticipate a higher dropout-rate of 10%.

```
<- 0.30
CV <- 0.95
theta0 <- 0.80
target <- "2x2x4"
design <- 0.1
do.rate <- 24
capacity <- sim.groups(CV = CV, theta0 = theta0, target = target,
x design = design, do.rate = do.rate, capacity = capacity)
```

`# Error: The adjusted sample size of 24 does not exhaust the clinical capacity.`

Although we anticipated a higher dropout-rate, we could readily perform the study in a single group. Would there be problems with power?

```
<- function(n, ns) {
make.equal # make equally sized sequences
return(as.integer(ns * (n %/% ns + as.logical(n %% ns))))
}<- function(n, do.rate, ns) {
nadj # adjust the sample size
return(as.integer(make.equal(n / (1 - do.rate), ns)))
}<- 0.30
CV <- 0.95
theta0 <- 0.80
target <- "2x2x4"
design <- 0.1
do.rate <- as.integer(substr(design, 3, 3))
ns <- sampleN.TOST(CV = CV, theta0 = theta0, targetpower = target,
x design = design, print = FALSE, details = FALSE)[7:8]
<- nadj(x[["Sample size"]], do.rate, ns):x[["Sample size"]]
elig <- data.frame(eligible = elig, power = NA_real_)
x for (i in seq_along(elig)) {
$power[i] <- suppressMessages(
xpower.TOST(CV = CV, theta0 = theta0,
n = elig[i], design = design))
}print(signif(x, 4), row.names = FALSE)
```

```
# eligible power
# 24 0.8819
# 23 0.8682
# 22 0.8543
# 21 0.8374
# 20 0.8202
```

Not at all. Furthermore, in the full replicate design performed in a single group we have more degrees of freedom than in the 2×2×2 crossover performed in two groups (see the table above).

```
<- 0.335
CV <- CV2mse(CV)
var <- c(44, 40)
n.cross <- c(24, 20)
n.repl <- data.frame(design = rep("2x2x2", 2),
cross groups = rep(2, 2),
n = n.cross,
df = n.cross - 2 - (2 - 1),
CL.lo = NA_real_, CL.hi = NA_real_)
<- data.frame(design = rep("2x2x4", 2),
repl groups = rep(1, 2),
n = n.repl,
df = 3 * n.repl - 4,
CL.lo = NA_real_, CL.hi = NA_real_)
for (i in 1:2) {
5:6] <- exp(log(theta0) - log(1) + c(-1, +1) *
cross[i, qt(1 - 0.05, df = cross$df[i]) *
sqrt(var / 2 * (1 / n.cross[1] + 1 / n.cross[2])))
5:6] <- exp(log(theta0) - log(1) + c(-1, +1) *
repl[i, qt(1 - 0.05, df = repl$df[i]) *
sqrt(var / 4 * (1 / n.repl[1] + 1 / n.repl[2])))
}$width <- cross$CL.hi - cross$CL.lo
cross$width <- repl$CL.hi - repl$CL.lo
repl<- cross$width - repl$width
diffs <- rbind(repl, cross)
x names(x)[2:3] <- c("group(s)", "N")
2] <- sprintf("%.0f ", x[, 2])
x[, 5] <- sprintf("%.3f%%", 100 * x[, 5])
x[, 6] <- sprintf("%.3f%%", 100 * x[, 6])
x[, 7] <- sprintf("%.3f%%", 100 * x[, 7])
x[, print(x, row.names = FALSE)
cat("Maximum difference in the width of CIs:",
sprintf("%.2g%%.\n",100 * max(diffs)))
```

```
# design group(s) N df CL.lo CL.hi width
# 2x2x4 1 24 68 87.492% 103.152% 15.660%
# 2x2x4 1 20 56 87.471% 103.177% 15.707%
# 2x2x2 2 44 41 87.277% 103.406% 16.128%
# 2x2x2 2 40 37 87.259% 103.428% 16.169%
# Maximum difference in the width of CIs: 0.47%.
```

Hence, the CIs in the replicate design will be ≈ 0.5% narrower than in the 2×2×2 crossover.

Say, we have to deal with a narrow therapeutic index drug
(BE-limits 90.00 – 111.11%) with low
variability of 7%. We assume a T/R-ratio of 0.975 and target ≥ 80% power
in a 2×2×2 crossover design. Furthermore, we anticipate a dropout-rate
of 5%. Due to potential AEs close
monitoring of subjects is required and the clinical site is equipped
with eight fully monitored beds. By specifying `equal = TRUE`

the script attempts to generate equally sized groups.

```
<- 0.07
CV <- 0.975
theta0 <- 0.90
theta1 <- 0.80
target <- "2x2x2"
design <- 0.05
do.rate <- 8
capacity <- sim.groups(CV = CV, theta0 = theta0, theta1 = theta1,
x target = target, design = design,
do.rate = do.rate, capacity = capacity,
equal = TRUE)
```

`# Note: Imbalanced sequences in groups (7|7) corrected.`

```
# Attempting to generate equally sized groups.
# CV : 0.0700
# theta0 : 0.9750
# BE-limits : 0.9000 – 1.1111
# Target power : 0.80
# Design : 2x2x2
# Clinical capacity : 8
# ——————————————————————————————————————————————
# Total sample size : 12
# Number of groups : 2
# Subjects per group : 6 | 6
# Degrees of freedom : 10 (pooled model 4)
# 9 (group model 5)
# ——————————————————————————————————————————————
# Anticip. dropout-rate: 5%
# Adjusted sample size : 14
# Number of groups : 2
# Subjects per group : 8 | 6
# Degrees of freedom : 12 (pooled model 4)
# 11 (group model 5)
# ——————————————————————————————————————————————
# Achieved power of the pooled and group models;
# relative change in power of the group model
# compared to the pooled model:
#
# n pooled model group model change
# Expected 12 0.8274 0.8218 -0.677%
# Adjusted 14 0.8849 0.8824 -0.283%
```

Two equally sized groups of six subjects would be possible with the
estimated sample size. That’s what we requested. However, for the
adjusted sample size of 14 subjects, groups of seven subjects would be
imbalanced.^{7} Hence, a note is issued and sequences are
corrected in such a way that both groups are balanced.

As usual, the loss in power by the group model \(\small{(5)}\) compared to the pooled
model \(\small{(4)}\) is less than one percent
and thus, negligble.

We can assess the Type I Error by setting \(\small{\theta_0}\) to one of the
BE limits \(\small{\left\{\theta_1,\theta_2\right\}}\),
*i.e.*, assume that the null hypothesis is true. This can be done
either by simulations with the function `power.TOST.sds()`

or
with the function `power.TOST()`

after adjusting the degrees
of freedom as outlined above. We need at least
10^{6} simulations in order to obtain a stable result.

```
# Cave: long runtime
<- 0.30
CV <- "2x2x2"
design <- 0.80
theta1 <- 1.25
theta2 <- c(22L, 18L)
ngrp <- 2
grps <- sum(ngrp)
n <- 3:2
gmodel <- 1e6
nsims <- c(n, n - (grps - 1))
adj <- data.frame(model = c("pooled", "group"),
x df = c(n - 2, n - 2 - (grps - 1)),
simulated = NA_real_, exact = NA_real_)
for (i in seq_along(gmodel)) {
$simulated[i] <- power.TOST.sds(CV = CV, theta0 = theta2, n = n,
xdesign = design, grps = grps,
ngrp = ngrp, gmodel = gmodel[i],
nsims = nsims, progress = FALSE)
$exact[i] <- suppressMessages(
xpower.TOST(CV = CV, theta0 = theta2,
n = adj[i], design = design))
}cat("CV :", sprintf("%.4f", CV),
"\nBE-limits :",
sprintf("%.4f – %.4f", theta1, theta2),
"\nDesign :", design,
"\nTotal sample size :", n,
"\nNumber of groups :", sprintf("%2.0f", grps),
"\nSubjects per group:", paste(ngrp, collapse = " | "),
"\nNull assessed at :", sprintf("%.4f\n\n", theta2))
print(x, row.names = FALSE, right = FALSE)
```

```
# CV : 0.3000
# BE-limits : 0.8000 – 1.2500
# Design : 2x2x2
# Total sample size : 40
# Number of groups : 2
# Subjects per group: 22 | 18
# Null assessed at : 1.2500
#
# model df simulated exact
# pooled 38 0.049801 0.04999975
# group 37 0.049791 0.04999961
```

Of course, in the pooled model \(\small{(4)}\) the Type I Error is controlled. Due to less degrees of freedom in the group model \(\small{(5)}\) the Type I Error is even lower.

Coming back to the question asked in the Introduction. To repeat:

What is the best approach to deal with the limited capacity of a clinical site?

- If ever possible find a clinical site which can accomodate all subjects at once.
- If that is not possible, consider multiple groups (single site) or multiple sites. The latter might be unavoidable for studies in patients where recruitment is difficult.
- Whether the statistical model has to be modified is still an open
issue.

The Russian authority – and recently*some*European agencies – prefer for multi-group studies**(5)**. For multi-site studies I recommend**(5)**. As shown, the loss in power compared to**(4)**is practically negligble.

The chosen model must be unambigously stated in the protocol. - Due to its smaller sample size a replicate design might be an
alternative avoiding multiple groups.
- It gives a slight incentive in terms of power, since the confidence interval will be narrower than the one of a 2×2×2 crossover due to the larger degrees of freedom.
- If a subject drops out from the study after the second period, the
data can still be used (in most
^{8}designs at least one observation of**T**and**R**is available). - Potential problems (as briefly outlined below and elaborated in another article) can be avoided.
- Even if the study was not planned for scaled average bioequivalence
(ABEL
or RSABE),
the additional information about the within-subject variability of
**R**(and in full replicated designs additionally of**T**) is a convenient side effect. It avoids the problems outlined in another article.

A member of the EMA’s
PKWP once told me
that he would like to see *all* studies performed in a replicate
design – *regardless* whether the drug / drug product is highly
variable or not. From a purely statistical perspective, he is right and
it is one of the rare cases where we were of the same opinion.

Power (and hence, the sample size) depends on the number of
treatments – the smaller sample size in replicate designs is compensated
by more administrations. If the total sample size of a 2×2×2 crossover
design is \(\small{N}\), then the
sample size of a four-period replicate design is \(\small{\tfrac{1}{2}N}\) and of a
three-period replicate design \(\small{\tfrac{3}{4}N}\). Costs of a
replicate design are similar to the common 2×2×2 crossover design (for
details see another
article).

Nevertheless, smaller sample sizes may come with a price. We will save
costs due to less pre-/post-study exams but have to pay a higher subject
remuneration (more hospitalizations and blood draws). However, we have
the same number of samples to analyze and study costs are driven to a
good part by bioanalytics.^{9} Furthermore, one must be aware that more
periods / washout phases increase the chance of dropouts.

**Testing for a significant group- (or site-) by-treatment
interaction should be avoided.**

In a nutshell: Model \(\small{(5)}\) is expanded with an interaction term (\(\small{F\times treatment}\)) to \[\eqalign{response&|\;F,\,\;sequence,\,subject(F\times sequence),\\ &\phantom{|}\;period(F),F\times sequence,\,treatment,\,F\times treatment}\tag{9}\]

- Test for a group- (or site-)
by-treatment interaction at
**α = 0.1**.^{10}- If
**α ≥ 0.1**, evaluate the study by the group model**(5)**. - If
**α < 0.1**, pooling is not allowed. Evaluate one of the groups – generally the largest^{11}– by**(4)**. That leads to a substantial loss in power.^{12} - As
*any*pre-test it will result in an inflated Type I Error (increased patient’s risk).^{13}^{14}^{15}

Not rocket science but basic statistics…

- If
- If certain conditions are fulfilled, the study can be evaluated by
the conventional model
**(4)**.^{11}

Regrettably the WHO
stated also:^{16}

“In those cases where the subjects are recruited and treated in groups, it is appropriate to investigate the statistical significance of the group-by-formulation interaction e.g., with the following ANOVA model: Group, Sequence, Formulation, Period (nested within Group), Group-by-Sequence interaction, Subject (nested within Group*Sequence) and Group-by-Formulation interaction. If this interaction is significant, the study results are not interpretable. However, it is not considered to be correct to report the results of the 90% confidence interval of the ratio test/comparator based on the standard error derived from this ANOVA. If the group-by-formulation interaction is not significant, the 90% confidence interval should be calculated based on the ANOVA model defined in the protocol. This model may or may not include the group effect as pre-defined in the protocol. This depends on whether the group effect is believed to explain the variability observed in the data.

On the contrary: It is __not__ appropriate!

At least we can state in the protocol that we don’t *believe*
[sic]^{17}
in a group-by-treatment interaction. *Believes* don’t belong to
the realm of science – only *assumptions* do. The problems
arising with \(\small{(9)}\) in terms of power and
inflation of the Type I Error is elaborated in another article.

When I gave a presentation^{18} about groups/sites in bioequivalence in
front of European regulatory (‼) statisticians, they were asking in
disbelieve »*What the heck are you talking about?*«

The European Guideline^{19} states:

“The study should be designed in such a way that the formulation effect can be distinguished from other effects.

The precise model to be used for the analysis should be pre-specified in the protocol. The statistical analysis should take into account sources of variation that can be reasonably assumed to have an effect on the response variable.

Is it *reasonable* to assume that groups have an effect on the
PK response (*i.e.*, a
*true* group-by-treatment interaction)?
IMHO, not at all. Recall that
we have the same inclusion- & exclusion-criteria and hence, similar
demographics, the same clinical site, quite often groups separated by
only a couple of days, and analyze samples with the same bioanalytical
method. In a crossover study each subject acts as its own control,
right?

If we follow the approach outlined above, we will face a false positive rate at the level of the test \(\small{\alpha}\) and will not be allowed to use \(\small{(4)}\) or \(\small{(5)}\) of pooled data in ~10% of cases. If that happens, the loss in power will be tremendous. Returning to the example from above:

```
<- "2x2x2"
design <- 0.95
theta0 <- 0.30
CV <- 2
grps <- c(24, 16)
ngrp <- data.frame(model = c("pooled", "groups", "large group"),
res n = c(rep(sum(ngrp), 2), ngrp[1]),
grps = c(rep(grps, 2), 1),
n.grp = c(rep(paste(ngrp, collapse = "|"), 2), ngrp[1]),
power = NA_real_, loss = NA_real_)
$power[1] <- power.TOST(CV = CV, theta0 = theta0,
resn = res$n[1], design = design)
$power[2] <- power.TOST.sds(CV = CV, theta0 = theta0, n = res$n[2],
resdesign = design, grps = grps, ngrp = ngrp,
gmodel = 2, progress = FALSE)
$power[3] <- power.TOST(CV = CV, theta0 = theta0,
resn = ngrp[1], design = design)
$loss[2:3] <- sprintf("%4.1f%%",
res100 * abs((res$power[2:3] - res$power[1]) /
$power[1]))
res$loss[is.na(res$loss)] <- "- "
res3] <- sprintf("%.0f ", res[, 3])
res[, 1] <- c("pooled (4)", "groups (5)", "large group only (4)")
res[$power <- signif(res$power, 4)
resnames(res)[c(2:4)] <- c("N", "group(s)", "n/group")
print(res, row.names = FALSE)
```

```
# model N group(s) n/group power loss
# pooled (4) 40 2 24|16 0.8158 -
# groups (5) 40 2 24|16 0.8120 0.5%
# large group only (4) 24 1 24 0.5577 31.6%
```

Given, that would happen only in ~10% of studies but if it does,
power will be hardly larger than tossing a coin.

A meta-study^{20} showed that the
*overall* loss in power in an evaluation by \(\small{(5)}\)
instead of by \(\small{(4)}\) can reach ~7%.

Hundreds (or thousands‽) of studies have been performed in multiple groups, evaluated by \(\small{(4)}\) and were accepted by European agencies in the twinkling of an eye.

Helmut Schütz 2024

`R`

and `PowerTOST`

GPL 3.0,
`pandoc`

GPL 2.0.

1^{st} version June 28, 2022. Rendered July 2, 2024 10:59 CEST
by rmarkdown
via pandoc in 0.28 seconds.

Labes D, Schütz H, Lang B.

*PowerTOST: Power and Sample Size for (Bio)Equivalence Studies.*Package version 1.5.6. 2024-03-18. CRAN.↩︎FDA, CDER.

*Guidance for Industry. Statistical Approaches to Establishing Bioequivalence.*Rockville. January 2001. Download.↩︎FDA, CDER.

*Draft Guidance for Industry. Statistical Approaches to Establishing Bioequivalence.*Silver Spring. December 2022. Revision 1. Download.↩︎ICH.

*Bioequivalence for Immediate Release Solid Oral Dosage Forms.*M13A. Draft version. 20 December 2022. Online.↩︎Statisticians may be more familiar with the terminology \(\small{\textrm{AB}|\textrm{BA}}\).

Forgive me – in bioequivalence \(\small{\textrm{TR}|\textrm{RT}}\) is commonly used, where \(\small{\textrm{T}}\) denotes the Test and \(\small{\textrm{R}}\) the Reference treatment (formulation).↩︎If Subject 1 is randomized to sequence \(\small{\text{TR}}\), there is not ‘another’ Subject 1 randomized to sequence \(\small{\text{RT}}\). Randomization is not like Schrödinger’s cat. Hence, the nested term in the guidelines is an insult to the mind.↩︎

Not the same number of subjects in sequence \(\small{\textrm{TR}}\) than in sequence \(\small{\textrm{RT}}\).↩︎

Of course, not in the \(\small{\textrm{TTRR|RRTR}}\) full replicate design. However, it is rarely employed in practice.↩︎

In case of a ‘poor’ bioanalytical method requiring a large sample volume: Since the total blood sampling volume is generally limited with the one of a blood donation, one may opt for a three-period full replicate or has to measure HCT prior to administration in higher periods and – for safety reasons – exclude subjects if their HCT is too high.↩︎

Since this is a

*between-subject*comparison, it has low power. IMHO, the 10% level is arbitrary (I failed to find any justification).↩︎What if more than one group have the same size? Has BE to be demonstrated for all of them? Quite possible, since that would be the most conservative approach. However, I failed to find a reference. That’s one of the reasons I prefer to have at least one group in the maximum capacity of the site instead of equally sized groups.↩︎

If you have – really! – nothing better to do (long runtime): Try the examples with the additional argument

`gmodel = 1`

in the function. In the first example 5.3–5.7% power will be lost and in the second 3.6–6.0%.↩︎Schütz H.

*Multi-Group Studies in Bioequivalence. To pool or not to pool?*Presentation at: BioBridges 2018. Prague. 26–27 September 2018. Online.↩︎For the examples I got in 10

^{6}simulations (extremely long runtime) an inflated Type I Error of ≈ 0.063.↩︎In a meeting in April 2021 the FDA accepted my concerns about an inflated Tpye I Error and agreed that in a planned multi-center study model \(\small{(5)}\)

*without*a pre-test should be used.↩︎WHO.

*Frequent deficiencies in BE study protocols.*Geneva. November 2020. Online.↩︎Quoting my late father: »

*If you believe, go to church.*«↩︎Schütz H.

*Multi-Group and Multi-Site Studies. To pool or not to pool?*2^{nd}Annual Biosimilars Forum. Satellite Short Course. Budapest. 5 October 2017. Online.↩︎EMA, CHMP.

*Guideline on the Investigation of Bioequivalence.*London. 20 January 2010. Online.↩︎Schütz H, Burger DA, Cobo E, Dubins D, Farkas T, Labes D, Lang B, Ocaña J, Ring A, Shitova A, Stus V, Tomashevskiy M.

*Group-by-Treatment Interaction Effects in Comparative Bioavailability Studies.*AAPS J. 2024; 26(3): 50. doi:10.1208/s12248-024-00921-x. Open Access. Supplementary Material.↩︎