Consider allowing JavaScript. Otherwise, you have to be proficient in reading since formulas will not be rendered. Furthermore, the table of contents in the left column for navigation will not be available and code-folding not supported. Sorry for the inconvenience.

Examples in this article were generated with
4.3.3
by the package `PowerTOST`

.^{1} See also its Online
manual^{2}
for details.

- The right-hand badges give the respective section’s ‘level’.

- Basics about sample size methodology – requiring no or only limited statistical expertise.

- These sections are the most important ones. They are – hopefully – easily comprehensible even for novices.

- A somewhat higher knowledge of statistics and/or R is required. May be skipped or reserved for a later reading.

- Click to show / hide R code.
- To copy R code to the clipboard click on the icon in the top left corner.

Abbreviation | Meaning |
---|---|

\(\small{\alpha}\) | Nominal level of the test, probability of Type I Error, patient’s risk |

\(\small{\beta}\) | Probability of Type II Error, producer’s risk |

(A)BE | (Average) Bioequivalence |

ABEL | Average Bioequivalence with Expanding Limits |

CI, CL | Confidence Interval, Limit |

\(\small{CV}\) | Coefficient of Variation |

\(\small{CV_\textrm{inter}}\) | Between-subject Coefficient of Variation |

\(\small{CV_\textrm{intra}}\) | Within-subject Coefficient of Variation |

\(\small{CV_\textrm{wR}}\) | Within-subject Coefficient of Variation of the Reference treatment |

\(\small{\delta}\) | Margin of clinical relevance in Non-Inferiority/Superiority and Non-Superiority |

\(\small{H_0}\) | Null hypothesis |

\(\small{H_1}\) | Alternative hypothesis (also \(\small{H_\textrm{a}}\)) |

L, U | Lower and upper limits in ABE(L) |

\(\small{\mu_\text{T},\,\mu_\text{R}}\) | True mean of the Test and Reference treatment, respectively |

\(\small{\pi}\) | Prospective power (\(\small{1-\beta}\)) |

TOST | Two One-Sided Tests |

What are the main statistical issues in planning a confirmatory experiment?

For details about inferential statistics and hypotheses see another article.

An ‘optimal’ study design is one, which – taking all assumptions into account – has a reasonably high chance of demonstrating non-inferiority or non-superiority (power) whilst controlling the patient’s risk.

Contrary to Bioequivalence (BE), where a study is assessed with \(\small{\alpha=0.05}\) by the
TOST-procedure (or more
commonly by the \(\small{100\,(1-2\;\alpha)}\) Confidence
Interval inclusion approach), in Non-Inferiority/Superiority and
Non-Superiority the respective *one-sided* test with \(\small{\alpha=0.025}\) is employed.

Based on a ‘clinically relevant margin’ \(\small{\delta}\) we have different hypotheses.

We assume that __higher__ responses are *better*.^{3} ^{4} If data
follow a lognormal
distribution the hypotheses are \[H_0:\frac{\mu_\text{T}}{\mu_\text{R}}\leq
\log_{e}\delta\;vs\;H_1:\frac{\mu_\text{T}}{\mu_\text{R}}>\log_{e}\delta\tag{1a}\]

If data follow a normal distribution the hypotheses are \[H_0:\mu_\text{T}-\mu_\text{R}\leq \delta\;vs\;H_1:\mu_\text{T}-\mu_\text{R}>\delta\tag{1b}\]

Applications:

- Clinical phase III trials comparing a new treatment with placebo or an established treatment (efficacy).
- Comparing minimum concentrations (
*C*_{min}) of a new Modified Release (MR) formulation with the ones of an approved Immediate Release (IR) formulation as a surrogate of efficacy.^{5}

We assume that __lower__ responses are *better*. If data
follow a lognormal distribution the hypotheses are
\[H_0:\frac{\mu_\text{T}}{\mu_\text{R}}\geq
\log_{e}\delta\;vs\;H_1:\frac{\mu_\text{T}}{\mu_\text{R}}<\log_{e}\delta\tag{1a}\]

If data follow a normal distribution the hypotheses are \[H_0:\mu_\text{T}-\mu_\text{R}\geq \delta\;vs\;H_1:\mu_\text{T}-\mu_\text{R}<\delta\tag{1b}\]

Applications:

- Clinical phase III trials comparing Adverse Effects of a new treatment with placebo or an established treatment (safety).
- Comparing maximum concentrations (
*C*_{max}) of a new MR formulation with the ones of an approved IR formulation as a surrogate of safety.^{5}

A *basic* knowledge of R is
required. To run the scripts at least version 1.4.9 (2019-12-19) of
`PowerTOST`

is suggested. Any version of R would likely do, though the current release of
`PowerTOST`

was only tested with version 4.2.3 (2023-03-15)
and later.

All scripts were run on a Xeon E3-1245v3 @ 3.40GHz (1/4 cores) 16GB RAM
with R 4.3.3 on Windows 7 build 7601, Service
Pack 1, Universal C Runtime 10.0.10240.16390.

Note that in the functions `sampleN.noninf()`

and
`power.noninf()`

the assumed coefficient of variation
`CV`

has to be given as a ratio and not in percent. If the
analysis is based on lognormal data by \(\small{(1\text{a})}\) or \(\small{(2\text{a})}\), the assumed
`theta0`

and margin \(\small{\delta}\) (`margin`

) have
to be given as ratios and not in percent. If the analysis is based on
normal data by \(\small{(1\text{b})}\) or \(\small{(2\text{b})}\),
`theta0`

and`margin`

have to be given with the
original value. Data have to be continuous on a ratio scale, either
lognormal \(\small{\left(x\in\mathbb{R}^{+}=\{0<x\leq\infty\}\right)}\)
or normal \(\small{\left(x\in\mathbb{R}=\{-\infty\leq
x\leq+\infty\}\right)}\) distributed.

Count data (*e.g.*, events), rates (0 – 1) and percentages, as
well as ordinal data (*e.g.*, *t*_{max}) are not
supported.

`sampleN.noninf()`

gives balanced sequences for crossover
designs (*i.e.*, the same number of subjects is allocated to all
sequences) or equal group sizes in a parallel design. Furthermore, the
estimated sample size is the *total* number of subjects, not
subjects per sequence or treatment arm – like in some other software
packages. The sample size functions of `PowerTOST`

use a
modification of Zhang’s method^{6} based on the large sample approximation as
the starting value of the iterations.

Most examples deal with studies where the response variables follow a
lognormal distribution, *i.e.*, we assume a multiplicative model
(ratios instead of differences). We work with \(\small{\log_{e}}\)-transformed data in
order to allow analysis by the *t*-test (requiring differences).
This is the default in most functions of `PowerTOST`

and
hence, the argument `logscale = TRUE`

does not need to be
specified.

In software providing only a two-sided \(\small{100(1-2\,\alpha)}\) confidence
interval for *equivalence* (*e.g.*, Phoenix WinNonlin,^{7} PKanalix^{8}): Use only
the *lower* (for Non-Inferiority/Superiority) or *upper*
(for Non-Superiority) confidence limit (see this article, Fig 3 for
an example), which is one-sided \(\small{100(1-\alpha)}\).

It may sound picky but ‘sample size __calculation__’ (as used in
most guidelines and alas, in some publications and textbooks) is sloppy
terminology. In order to get prospective power (and hence, a sample
size), we need five values:

- The level of the test \(\small{\alpha}\) (in Non-Superiority / Non-Inferiority commonly 0.025),
- the clinicall relevant margin \(\small{\delta}\),
- the desired (or target) power \(\small{\pi}\),
- the variance (commonly expressed as a coefficient of variation), and
- the deviation of the test from the reference treatment.

1 – 2 are __fixed__ by the agency,

3 is __set__ by the sponsor, and

4 – 5 are just (uncertain!) __assumptions__.

In other words, obtaining a sample size is *not* an
*exact* calculation like \(\small{2\times2=4}\) but always just an
__estimation__.

“Power Calculation – A guess masquerading as mathematics.

Realization: Observations (in a sample) of a random variable (of the
population).

Of note, it is extremely unlikely that all assumptions will be
exactly realized in a particular study. Hence, calculating retrospective
(a.k.a. *post hoc*, *a
posteriori*) power is not only futile but plain nonsense.^{10}

Since generally the within-subject variability is lower than the between-subject variability, crossover studies are so popular. The efficiency of a crossover study compared to a parallel study is given by \(\small{\frac{\sigma_\textrm{intra}^2\;+\,\sigma_\textrm{inter}^2}{0.5\,\times\,\sigma_\textrm{intra}^2}}\). If, say, \(\small{\sigma_\textrm{intra}^2=0.5\times\sigma_\textrm{inter}^2}\) in a paralled study we need six times as many subjects than in a crossover to obtain the same power. On the other hand, in a crossover we have two measurements per subject, which makes the parallel study approximately three times more costly.

Note that there is *no* relationship
between \(\small{CV_\textrm{intra}}\)
and \(\small{CV_\textrm{inter}}\). An
example are drugs which are subjected to polymorphic metabolism. For
these drugs \(\small{CV_\textrm{intra}\ll
CV_\textrm{inter}}\). On the other hand, some
HVD(P)s show
\(\small{CV_\textrm{intra}>CV_\textrm{inter}}\).

Carryover: A residual effect of a previous period.

It is a prerequisite that no – unequal – carryover from one period to
the next exists. Only then the comparison of treatments will be unbiased.
For details see another
article.^{11} Subjects have to be in the same
physiological state^{12} throughout the study – guaranteed by a
sufficiently long washout phase. Crossover studies cannot only be
performed in healthy volunteers but also in patients with a
*stable* disease (*e.g.*, asthma).

In patients with an *instable* disease (*e.g.*, in
oncology) or if adverse effects are unacceptable in healthy volunteers,
studies __must__ be performed in a parallel design. If crossovers are
not feasible (*e.g.*, for drugs with a very long half life),
studies could be performed in a parallel design as well.

The sample size cannot be
*directly* estimated,

only power calculated for an already *given* sample
size.

The power equations cannot be re-arranged to solve for sample size.

“Power. That which statisticians are always calculating but never have.

`library(PowerTOST) # attach it to run the examples`

Throughout the examples I am considering studies in a *single*
center – not multiple groups *within* it or multicenter studies.
That’s another cup of tea (for ‘problems’ in
BE see another article).

Most methods of `PowerTOST`

are based on pairwise
comparisons. It is up to you to adjust the level of the test
`alpha`

if you want to compare more (*e.g.*, two test
treatments *vs* a reference) in order to avoid inflation of the
family-wise
error rate due to multiplicity.

Say, we want to demonstrate Non-Inferiority in a 2×2×2
crossover-design, assume a *CV* of 25%, a T/R-ratio of 0.95,
\(\small{\delta}\) 0.8, and target a
power of at least 0.80.

Since `alpha = 0.025`

, `theta0 = 0.95`

,
`margin = 0.8`

, `targetpower = 0.8`

,
`design = "2x2"`

, and `logscale = TRUE`

are
defaults of the function, we don’t have to give them explicitly.

`sampleN.noninf(CV = 0.25)`

```
#
# ++++++++++++ Non-inferiority test +++++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover
# log-transformed data (multiplicative model)
#
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 0.8
# True ratio = 0.95, CV = 0.25
#
# Sample size (total)
# n power
# 36 0.820330
```

If you want to perform the analysis with untransformed data, specify
`logscale = FALSE`

. Then the defaults are
`theta0 = -0.05`

and `margin = -0.2`

.

`sampleN.noninf(CV = 0.25, logscale = FALSE)`

```
#
# ++++++++++++ Non-inferiority test +++++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover
# untransformed data (additive model)
#
# alpha = 0.025, target power = 0.8
# Non-inf. margin = -0.2
# True diff. = -0.05, CV = 0.25
#
# Sample size (total)
# n power
# 46 0.803507
```

Let’s return to lognormal distributed data because that’s more
common.

Say, you have information from a pilot study that the treatment performs
*really* (*i.e.*, 30%) better than placebo. You are
cautious (good idea!), and assume a *lower* T/R-ratio and a
*higher CV* than the observed 1.30 and 0.25.

`sampleN.noninf(CV = 0.28, theta = 1.25, margin = 1)`

```
#
# ++++++++++++ Non-inferiority test +++++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover
# log-transformed data (multiplicative model)
#
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 1
# True ratio = 1.25, CV = 0.28
#
# Sample size (total)
# n power
# 26 0.802234
```

What about a parallel design? Likely the *CV* will be
substantially higher.^{14}

`sampleN.noninf(CV = 0.50, theta = 1.25, margin = 1, design = "parallel")`

```
#
# ++++++++++++ Non-inferiority test +++++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups
# log-transformed data (multiplicative model)
#
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 1
# True ratio = 1.25, CV = 0.5
#
# Sample size (total)
# n power
# 144 0.803753
```

I hear the ‘Guy in the Armani suit’^{15} shouting »*C’mon,
72 subjects / arm, who shall pay for that? Hey, we have the wonder-drug!
It works twice as good as snake oil!*«

`sampleN.noninf(CV = 0.50, theta = 2, margin = 1, design = "parallel")`

```
#
# ++++++++++++ Non-inferiority test +++++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups
# log-transformed data (multiplicative model)
#
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 1
# True ratio = 2, CV = 0.5
#
# Sample size (total)
# n power
# 18 0.831844
```

Cross fingers that the drug performs *really* that great. If
it is actually just 60% better than snake oil, power with this sample
size will be only ≈51%. Master of disaster…

Possibly the ‘Guy in the Armani suit’ has read about ‘allocation
ratios’ in the COVID-19 vaccination trials and asks »*Why should we
treat as many patients with snake oil than with our
wonder-drug?*«

Let’s see.

```
<- function(n, alloc) {
round.up return(as.integer(alloc * (n %/% alloc + as.logical(n %% alloc))))
}<- 0.50 # Total (pooled) CV
CV <- 2 # Assumed T/R-ratio
theta0 <- 1 # Non-Inferiority margin
margin <- 0.8 # Target (desired) power
target <- 3 # Allocation of wonder-drug (T)
alloc.T <- 1 # Allocation of snake oil (R)
alloc.R # conventional 1:1
<- sampleN.noninf(CV = CV, theta0 = theta0, margin = margin,
tmp design = "parallel", targetpower = target,
print = FALSE)
.0 <- as.integer(tmp[["Sample size"]])
n.0 <- tmp[["Achieved power"]]
pwr# 3:1 allocation (naïve)
.1 <- setNames(c(round.up(n.0 / (alloc.T + alloc.R) * alloc.T, alloc.T),
nround.up(n.0 / (alloc.T + alloc.R) * alloc.R, alloc.R)),
c("Test", "Reference"))
.1 <- power.noninf(CV = CV, theta0 = theta0, margin = margin,
pwrn = n.1, design = "parallel")
# 3:1 allocation (preserving power)
.2 <- n.1
nrepeat {# increase the sample size if necessary
.2 <- power.noninf(CV = CV, theta0 = theta0, margin = margin,
pwrn = n.2, design = "parallel")
if (pwr.2 >= target) break
.2[["Test"]] <- as.integer(n.2[["Test"]] + alloc.T)
n.2[["Reference"]] <- as.integer(n.2[["Reference"]] + alloc.R)
n
}<- paste0("%", nchar(as.character(n.0)), ".0f")
fmt cat("\n++++++++++++ Non-inferiority test +++++++++++++",
"\n Sample size estimation",
"\n-----------------------------------------------",
"\nStudy design: 2 parallel groups",
"\nlog-transformed data (multiplicative model)",
"\n\nalpha = 0.025, target power =", target,
"\nNon-inf. margin =", margin,
paste0("\nTrue ratio = ", theta0, ", CV = ", CV),
"\n\nTotal sample size =", n.0, "(1:1 allocation)",
paste0("\n n (T) = ", sprintf(fmt, n.0/2),
", n (R) = ", sprintf(fmt, n.0/2),
": power = ", signif(pwr.0, 6)),
"\nTotal sample size =", sum(n.1),
"(naïve", paste0(alloc.T, ":", alloc.R, " allocation)"),
"penalty", sprintf("%.0f%%", 100*(sum(n.1)/n.0-1)),
paste0("\n n (T) = ", sprintf(fmt, n.1[["Test"]]),
", n (R) = ", sprintf(fmt, n.1[["Reference"]]),
": power = ", signif(pwr.1, 6)),
"change", sprintf("%+.2f%%", 100 * (pwr.1 - pwr.0) / pwr.0),
"\nTotal sample size =", sum(n.2),
paste0("(", alloc.T, ":", alloc.R, " allocation)"),
sprintf("%13s %.0f%%", "penalty", 100*(sum(n.2)/n.0-1)),
paste0("\n n (T) = ", sprintf(fmt, n.2[["Test"]]),
", n (R) = ", sprintf(fmt, n.2[["Reference"]]),
": power = ", signif(pwr.2, 6)),
"change", sprintf("%+.2f%%", 100 * (pwr.2 - pwr.0) / pwr.0), "\n")
```

```
#
# ++++++++++++ Non-inferiority test +++++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups
# log-transformed data (multiplicative model)
#
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 1
# True ratio = 2, CV = 0.5
#
# Total sample size = 18 (1:1 allocation)
# n (T) = 9, n (R) = 9: power = 0.831844
# Total sample size = 20 (naïve 3:1 allocation) penalty 11%
# n (T) = 15, n (R) = 5: power = 0.766496 change -7.86%
# Total sample size = 24 (3:1 allocation) penalty 33%
# n (T) = 18, n (R) = 6: power = 0.844798 change +1.56%
```

Already in the naïve 3:1 allocation you have to round the sample size
up because the 18 of the 1:1 allocation is not a mutiple of 4.
Nevertheless, you loose 7.86% power. In order to preserve power, you
have to increase the sample size further.

However, it’s still based on a strong *belief* in the performance
of the wonder-drug. If it again turns out to be just 60% better than
snake oil, power with 24 subjects in the 3:1 allocation will be only
≈52%. Hardly better than tossing a coin.

Compare a new MR
formulation (regimen once a day) with an
IR formulation (twice a day).
*C*_{max} is the surrogate metric for *safety*
(Non-Superiority) and *C*_{min} is the surrogate metric
for *efficacy* (Non-Inferiority):

“[…] therapeutic studies might be waived [if …]:

- there is a well-defined therapeutic window in terms of safety and efficacy, the rate of input is known not to influence the safety and efficacy profile or the risk for tolerance development and

- bioequivalence between the reference and the test product is shown in terms of AUC
_{(0-τ),ss}and- C
_{max,ss}for the new MR formulation is below or equivalent to the C_{max,ss}for the approved formulation and C_{min,ss}for the MR formulation is above or equivalent to the C_{min,ss}of the approved formulation.

Although not explicitly stated in the guideline, AFAIK the EMA expects tests at \(\small{\alpha=0.05}\).

Margins are 1.25 for *C*_{max} and 0.80 for
*C*_{min}. We assume *CV*s of 0.15 for
*AUC*, 0.20 for *C*_{max}, 0.35 for
*C*_{min}, T/R-ratios of 0.95 for *AUC* and
*C*_{min} and 1.05 for *C*_{max}. We plan
the study in a 2-treatment, 2-sequence, 4-period full replicate design due to the high
variability of *C*_{min}.

Which PK metric leads the sample
size in such a Bioequivalence (*AUC*) / Non-Superiority
(*C*_{max}) / Non-Inferiority (*C*_{min})
study?

```
<- "2x2x4"
design <- data.frame(design = "2x2x4", metric = c("AUC", "Cmax", "Cmin"),
x margin = c(NA, 1.25, 0.80), CV = c(0.15, 0.20, 0.35),
theta0 = c(0.95, 1.05, 0.95), n = NA_integer_,
power = NA_real_, stringsAsFactors = FALSE)
for (i in 1:nrow(x)) {
if (x$metric[i] == "AUC") {# ABE
6:7] <- sampleN.TOST(design = design,
x[i, theta0 = x$theta0[i],
CV = x$CV[i],
details = FALSE,
print = FALSE)[7:8]
if (x$n[i] < 12) {# minimum acc. to GLs
$n[i] <- 12
x$power[i] <- power.TOST(design = design,
xtheta0 = x$theta0[i],
CV = x$CV[i],
n = x$n[i])
}else { # Non-Inferiority, Non-Superiority
}6:7] <- sampleN.noninf(design = design,
x[i, alpha = 0.05,
margin = x$margin[i],
theta0 = x$theta0[i],
CV = x$CV[i],
details = FALSE,
print = FALSE)[6:7]
if (x$n[i] < 12) {# minimum acc. to GLs
$n[i] <- 12
x$power[i] <- power.noninf(design = design,
xalpha = 0.05,
margin = x$margin[i],
theta0 = x$theta0[i],
CV = x$CV[i],
n = x$n[i])
}
}
}$power <- signif(x$power, 4) # cosmetics
x$margin <- sprintf("%.2f", x$margin)
x$margin[x$margin == "NA"] <- "– "
xprint(x, row.names = FALSE)
cat(paste0("Sample size lead by ", x$metric[x$n == max(x$n)], ".\n"))
```

```
# design metric margin CV theta0 n power
# 2x2x4 AUC – 0.15 0.95 12 0.9881
# 2x2x4 Cmax 1.25 0.20 1.05 12 0.9098
# 2x2x4 Cmin 0.80 0.35 0.95 26 0.8184
# Sample size lead by Cmin.
```

However, with 26 subjects to show Non-Inferiority of
*C*_{min} the study is ‘overpowered’ (see this article) for
BE of *AUC* and
Non-Superiority of *C*_{max}:

```
cat("Power with", max(x$n), "subjects for",
"\nAUC :",
power.TOST(design = design, CV = x$CV[1],
theta0 = x$theta0[1], n = max(x$n)),
"\nCmax:",
power.noninf(design = design, alpha = 0.05, margin = 1.25,
CV = x$CV[3], theta0 = x$theta0[3], n = max(x$n)), "\n")
```

```
# Power with 26 subjects for
# AUC : 0.9999851
# Cmax: 0.9974663
```

That gives us some space to navigate for *e.g.*,
*C*_{max} if values turn out to be ‘worse’ (say,
*CV* 0.20 → 0.25, T/R-ratio 1.05 → 1.10):

```
power.noninf(design = design, alpha = 0.05, margin = x$margin[3],
CV = 0.25, theta0 = 1.10, n = max(x$n)) # higher CV, worse theta0
```

`# [1] 0.8359967`

The bracketing approach __may__ require a lower sample size than
required for demonstrating BE with
the common CI-inclusion
approach for all PK metrics, which
is another option mentioned in the guideline.^{3} Note that reference-scaling by
ABEL is
acceptable for *C*_{max}^{5}
^{16} and
*C*_{min}^{5} if their
*CV*_{wR} >30%, expanding the limits can be justified
based on clinical grounds, and *CV*_{wR} > 30% is not
caused by ‘outliers’. How does that compare?

```
<- data.frame(design = design, method = "ABE",
y metric = c("AUC", "Cmax", "Cmin"),
CV = c(0.15, 0.20, 0.35),
theta0 = c(0.95, 1.05, 0.90),
L = 0.8, U = 1.25, n = NA_integer_,
power = NA_real_, stringsAsFactors = FALSE)
for (i in 1:nrow(y)) {
if (y$metric[i] == "AUC" | y$CV[i] <= 0.3) {
8:9] <- sampleN.TOST(CV = y$CV[i], theta0 = y$theta0[i],
y[i, design = design, print = FALSE,
details = FALSE)[7:8]
if (y$n[i] < 12) {# minimum acc. to the GL
$n[i] <- 12
y$power[i] <- power.TOST(CV = y$CV[i], theta0 = y$theta0[i],
ydesign = design, n = y$n[i])
}else {
}$method[i] <- "ABEL"
y6:7] <- scABEL(CV = y$CV[i])
y[i, 8:9] <- sampleN.scABEL(CV = y$CV[i], theta0 = y$theta0[i],
y[i, design = design, print = FALSE,
details = FALSE)[8:9]
}
}$L <- sprintf("%.2f%%", 100 * y$L) # cosmetics
y$U <- sprintf("%.2f%%", 100 * y$U)
y$power <- signif(y$power, 4)
ynames(y)[6:7] <- c("L ", "U ")
print(y, row.names = FALSE)
```

```
# design method metric CV theta0 L U n power
# 2x2x4 ABE AUC 0.15 0.95 80.00% 125.00% 12 0.9881
# 2x2x4 ABE Cmax 0.20 1.05 80.00% 125.00% 12 0.9085
# 2x2x4 ABEL Cmin 0.35 0.90 77.23% 129.48% 34 0.8118
```

Which approach is optimal is a case-to-case decision. Although in
this example bracketing is the ‘winner’ (26 subjects instead of 34), it
might be problematic if a *CV* is larger and/or a T/R-ratio worse
than assumed: *CV* of *AUC* 0.15 → 0.20,
*C*_{max} 0.20 → 0.25, *C*_{min} 0.35 →
0.50; T/R-ratio of *AUC* 0.95 → 0.90, *C*_{max}
1.05 → 1.12, *C*_{min} 0.90 → 0.88.

```
<- max(y$n)
n <- data.frame(approach = c("ABE", "Non-Superiority", "ABE",
z "Non-Inferiority", "ABE"),
metric = c("AUC", rep(c("Cmax", "Cmin"), each = 2)),
CV = c(0.2, rep(c(0.25, 0.50), each = 2)),
theta0 = c(0.90, rep(c(1.12, 0.88), each = 2)),
margin = c(NA, 1.25, NA, 0.80, NA),
L = c(0.80, NA, 0.80, NA, 0.80),
U = c(1.25, NA, 1.25, NA, 1.25),
n = n, power = NA_real_,
stringsAsFactors = FALSE)
for (i in 1:nrow(z)) {
if (z$approach[i] %in% c("Non-Superiority", "Non-Inferiority")) {
$power[i] <- power.noninf(design = design,
zalpha = 0.05,
margin = z$margin[i],
theta0 = z$theta0[i],
CV = z$CV[i],
n = z$n[i])
else {
}if (z$CV[i] <= 0.3) {
$power[i] <- power.TOST(design = design,
ztheta0 = z$theta0[i],
CV = z$CV[i],
n = z$n[i])
else {
}$approach[i] <- "ABEL"
z6:7] <- scABEL(CV = z$CV[i])
z[i, $power[i] <- power.scABEL(design = design,
ztheta0 = z$theta0[i],
CV = z$CV[i],
n = z$n[i])
}
}
}$L <- sprintf("%.2f%%", 100 * z$L) # cosmetics
z$U <- sprintf("%.2f%%", 100 * z$U)
z$power <- signif(z$power, 4)
z$margin <- sprintf("%.2f", z$margin)
z$margin[z$margin == "NA"] <- "– "
z$L[z$L == "NA%"] <- "– "
z$U[z$U == "NA%"] <- "– "
znames(z)[6:7] <- c("L ", "U ")
print(z, row.names = FALSE)
```

```
# approach metric CV theta0 margin L U n power
# ABE AUC 0.20 0.90 – 80.00% 125.00% 34 0.9640
# Non-Superiority Cmax 0.25 1.12 1.25 – – 34 0.8258
# ABE Cmax 0.25 1.12 – 80.00% 125.00% 34 0.8258
# Non-Inferiority Cmin 0.50 0.88 0.80 – – 34 0.3169
# ABEL Cmin 0.50 0.88 – 69.84% 143.19% 34 0.8183
```

**Non-Superiority / Non-Inferiority**

We will pass*C*_{max}(note that its power equals the one of ABE) but fail^{17}*C*_{min}.**ABEL / ABE**

Although we yet have to assess*C*_{max}by ABE (*CV*_{wR}< 30%), it is not ‘overpowered’ any more.

In reference-scaling by ABEL*C*_{min}will still pass due to more expansion of the limits (69.84% – 143.19% for*CV*_{wR}50% instead of 77.23% – 129.48% for*CV*_{wR}35%).

Hence, in this case the equivalence approach by ABE(L) is the ‘winner’ because it tolerates more deviations from assumptions.

What happens if you fail to convince the agency that ABEL is acceptable? The picture changes.

```
<- data.frame(design = design, method = "ABE",
a metric = c("AUC", "Cmax", "Cmin"),
CV = c(0.15, 0.20, 0.35),
theta0 = c(0.95, 1.05, 0.90),
L = 0.8, U = 1.25, n = NA_integer_,
power = NA_real_, stringsAsFactors = FALSE)
for (i in 1:nrow(a)) {
8:9] <- sampleN.TOST(CV = a$CV[i], theta0 = a$theta0[i],
a[i, design = design, print = FALSE,
details = FALSE)[7:8]
if (a$n[i] < 12) {# minimum acc. to the GL
$n[i] <- 12
a$power[i] <- power.TOST(CV = a$CV[i], theta0 = a$theta0[i],
adesign = design, n = a$n[i])
}
}$L <- sprintf("%.2f%%", 100 * a$L) # cosmetics
a$U <- sprintf("%.2f%%", 100 * a$U)
a$power <- signif(a$power, 4)
anames(a)[6:7] <- c("L ", "U ")
print(a, row.names = FALSE)
```

```
# design method metric CV theta0 L U n power
# 2x2x4 ABE AUC 0.15 0.95 80.00% 125.00% 12 0.9881
# 2x2x4 ABE Cmax 0.20 1.05 80.00% 125.00% 12 0.9085
# 2x2x4 ABE Cmin 0.35 0.90 80.00% 125.00% 52 0.8003
```

Nasty – we need a ≈53% larger sample size.

If all values turn out to be as worse as above:

```
<- data.frame(design = design, method = "ABE",
b metric = c("AUC", "Cmax", "Cmin"),
CV = c(0.20, 0.25, 0.50),
theta0 = c(0.90, 1.12, 0.88),
L = 0.8, U = 1.25, n = NA_integer_,
power = NA_real_, stringsAsFactors = FALSE)
for (i in 1:nrow(a)) {
8:9] <- sampleN.TOST(CV = b$CV[i], theta0 = b$theta0[i],
b[i, design = design, print = FALSE,
details = FALSE)[7:8]
if (b$n[i] < 12) {# minimum acc. to the GL
$n[i] <- 12
b$power[i] <- power.TOST(CV = b$CV[i], theta0 = b$theta0[i],
bdesign = design, n = b$n[i])
}
}$L <- sprintf("%.2f%%", 100 * b$L) # cosmetics
b$U <- sprintf("%.2f%%", 100 * b$U)
b$power <- signif(b$power, 4)
bnames(b)[6:7] <- c("L ", "U ")
print(b, row.names = FALSE)
```

```
# design method metric CV theta0 L U n power
# 2x2x4 ABE AUC 0.20 0.90 80.00% 125.00% 18 0.8007
# 2x2x4 ABE Cmax 0.25 1.12 80.00% 125.00% 32 0.8050
# 2x2x4 ABE Cmin 0.50 0.88 80.00% 125.00% 154 0.8038
```

End of the story. Recall that this is a study in a 2-treatment, 2-sequence, 4-period full replicate design.

```
<- data.frame(design = "2x2x4",
c approach = c("ABE", "Non-Superiority", "Non-Inferiority"),
metric = c("AUC", "Cmax", "Cmin"),
margin = c(NA, 1.25, 0.80), CV = c(0.20, 0.25, 0.50),
theta0 = c(0.90, 1.12, 0.88), n = NA_integer_,
power = NA_real_, stringsAsFactors = FALSE)
for (i in 1:nrow(c)) {
if (c$approach[i] == "ABE") {# ABE
7:8] <- sampleN.TOST(CV = c$CV[i], theta0 = c$theta0[i],
c[i, design = c$design[i], print = FALSE,
details = FALSE)[7:8]
if (c$n[i] < 12) {# minimum acc. to the GL
$n[i] <- 12
c$power[i] <- power.TOST(CV = c$CV[i], theta0 = c$theta0[i],
cdesign = c$design[i], n = c$n[i])
}else { # Non-Inferiority, Non-Superiority
}7:8] <- sampleN.noninf(alpha = 0.05, CV = c$CV[i],
c[i, margin = c$margin[i], theta0 = c$theta0[i],
design = c$design[i], details = FALSE,
print = FALSE)[6:7]
if (c$n[i] < 12) {# minimum acc. to GLs
$n[i] <- 12
c$power[i] <- power.noninf(alpha = 0.05, CV = c$CV[i],
cmargin = c$margin[i], theta0 = c$theta0[i],
design = c$design[i], n = c$n[i])
}
}
}$power <- signif(c$power, 4) # cosmetics
c$margin <- sprintf("%.2f", c$margin)
c$margin[c$margin == "NA"] <- "– "
cprint(c, row.names = FALSE)
```

```
# design approach metric margin CV theta0 n power
# 2x2x4 ABE AUC – 0.20 0.90 18 0.8007
# 2x2x4 Non-Superiority Cmax 1.25 0.25 1.12 32 0.8050
# 2x2x4 Non-Inferiority Cmin 0.80 0.50 0.88 154 0.8038
```

As an aside, we would need also require 154 subjects to demonstrate
Non-Inferiority of *C*_{min} in the bracketing approach.
Perhaps it is readily more economical to opt for a clinical trial…

**Q**: Can we use R in a regulated environment and is`PowerTOST`

validated?

**A**: See this document^{18}about the acceptability of Base`R`

and its SDLC.^{19}`R`

is updated every couple of months with documented changes^{20}and maintaining a bug-tracking system.^{21}I recommend to use always the latest release.

The authors of`PowerTOST`

tried to do their best to provide reliable and valid results. The package’s`NEWS`

documents its development, bug-fixes, and introduction of new methods. Issues are tracked at GitHub (as of today none is still open). So far the package had >113,000 downloads. Therefore, it is extremely unlikely that bugs were not detected given its large user base.

However, the ultimate responsibility of validating*any*software (yes, of SAS as well…) lies in the hands of the user.^{22}^{23}^{24}**Q**: I still have questions. How to proceed?

**A**: The preferred method is to register at the BEBA Forum and post your question in the category (please read the Forum’s Policy first).

You can contact me at [email protected]. Be warned – I will charge you for anything beyond most basic questions.

Helmut Schütz 2024

`R`

, `PowerTOST`

, and
`arsenal`

GPL 3.0,
`klippy`

MIT,
`pandoc`

GPL 2.0.

1^{st} version July 24, 2022. Rendered April 8, 2024 14:03 CEST
by rmarkdown
via pandoc in 0.36 seconds.

Labes D, Schütz H, Lang B.

*PowerTOST: Power and Sample Size for (Bio)Equivalence Studies.*Package version 1.5.6. 2024-03-18. CRAN.↩︎Labes D, Schütz H, Lang B.

*Package ‘PowerTOST’.*March 18, 2024. CRAN.↩︎Chow S-C, Shao J, Wang H.

*Sample Size Calculations in Clinical Research.*New York: Marcel Dekker; 2003. Chapter 3.↩︎Julious SA.

*Sample Sizes for Clinical Trials.*Boca Raton: CRC Press; 2010. Chapter 4.↩︎EMA, CHMP.

*Guideline on the pharmacokinetic and clinical evaluation of modified release dosage forms.*London. 20 November 2014. Online.↩︎Zhang P.

*A Simple Formula for Sample Size Calculation in Equivalence Studies.*J Biopharm Stat. 2003; 13(3): 529–38. doi:10.1081/BIP-120022772.↩︎Certara USA, Inc. Princeton, NJ. 2023.

*Phoenix WinNonlin*. Online.↩︎Senn S.

*Guernsey McPearson’s Drug Development Dictionary.*April 21, 2020. Online.↩︎Hoenig JM, Heisey DM.

*The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis.*Am Stat. 2001; 55(1): 19–24. doi:10.1198/000313001300339897. Open Access.↩︎In short: There is no statistical method to ‘correct’ for unequal carryover. It can only be avoided by design,

*i.e.*, a sufficiently long washout between periods. According to the guidelines subjects with pre-dose concentrations > 5% of their*C*_{max}can by excluded from the comparison if stated in the protocol.↩︎Especially important for drugs which are auto-inducers or -inhibitors and biologics.↩︎

Senn S.

*Statistical Issues in Drug Development.*Chichester: John Wiley; 2^{nd}ed 2007.↩︎It depends on

*both*the within- and between-subject variances. In general the latter is larger than the former (see above).↩︎‘The Guy in the Armani suit’ (© ElMaestro, introduced there) is a running gag in the BEBA Forum. They are only proficient in PowerPoint, copypasting from one document to an other, and shouting »

*You are Fired!*« if a study fails.↩︎EMA, CHMP.

*Guideline on the Investigation of Bioequivalence.*London. 20 January 2010. Online.↩︎Any power < 50% is a failure by definition.↩︎

The R Foundation for Statistical Computing.

*A Guidance Document for the Use of R in Regulated Clinical Trial Environments.*Vienna. October 18, 2021. Online.↩︎The R Foundation for Statistical Computing.

*R: Software Development Life Cycle. A Description of R’s Development, Testing, Release and Maintenance Processes.*Vienna. October 18, 2021. Online.↩︎FDA.

*Statistical Software Clarifying Statement.*May 6, 2015. Download.↩︎WHO.

*Guidance for organizations performing in vivo bioequivalence studies.*Geneva. May 2016. Technical Report Series No. 996, Annex 9. Section 4. Online.↩︎ICH.

*Good Clinical Practice (GCP).*E6(R3) – Draft. 19 May 2023. Section 4.5. Online.↩︎