Consider allowing JavaScript. Otherwise, you have to be proficient in reading since formulas will not be rendered. Furthermore, the table of contents in the left column for navigation will not be available and code-folding not supported. Sorry for the inconvenience.
Examples in this article were generated with
4.3.3
by the package PowerTOST
.1 See also its Online
manual2
for details.
Abbreviation | Meaning |
---|---|
\(\small{\alpha}\) | Nominal level of the test, probability of Type I Error, patient’s risk |
\(\small{\beta}\) | Probability of Type II Error, producer’s risk |
(A)BE | (Average) Bioequivalence |
ABEL | Average Bioequivalence with Expanding Limits |
CI, CL | Confidence Interval, Limit |
\(\small{CV}\) | Coefficient of Variation |
\(\small{CV_\textrm{inter}}\) | Between-subject Coefficient of Variation |
\(\small{CV_\textrm{intra}}\) | Within-subject Coefficient of Variation |
\(\small{CV_\textrm{wR}}\) | Within-subject Coefficient of Variation of the Reference treatment |
\(\small{\delta}\) | Margin of clinical relevance in Non-Inferiority/Superiority and Non-Superiority |
\(\small{H_0}\) | Null hypothesis |
\(\small{H_1}\) | Alternative hypothesis (also \(\small{H_\textrm{a}}\)) |
L, U | Lower and upper limits in ABE(L) |
\(\small{\mu_\text{T},\,\mu_\text{R}}\) | True mean of the Test and Reference treatment, respectively |
\(\small{\pi}\) | Prospective power (\(\small{1-\beta}\)) |
TOST | Two One-Sided Tests |
What are the main statistical issues in planning a confirmatory experiment?
For details about inferential statistics and hypotheses see another article.
An ‘optimal’ study design is one, which – taking all assumptions into account – has a reasonably high chance of demonstrating non-inferiority or non-superiority (power) whilst controlling the patient’s risk.
Contrary to Bioequivalence (BE), where a study is assessed with \(\small{\alpha=0.05}\) by the
TOST-procedure (or more
commonly by the \(\small{100\,(1-2\;\alpha)}\) Confidence
Interval inclusion approach), in Non-Inferiority/Superiority and
Non-Superiority the respective one-sided test with \(\small{\alpha=0.025}\) is employed.
Based on a ‘clinically relevant margin’ \(\small{\delta}\) we have different hypotheses.
We assume that higher responses are better.3 4 If data follow a lognormal distribution the hypotheses are \[H_0:\frac{\mu_\text{T}}{\mu_\text{R}}\leq \log_{e}\delta\;vs\;H_1:\frac{\mu_\text{T}}{\mu_\text{R}}>\log_{e}\delta\tag{1a}\]
If data follow a normal distribution the hypotheses are \[H_0:\mu_\text{T}-\mu_\text{R}\leq \delta\;vs\;H_1:\mu_\text{T}-\mu_\text{R}>\delta\tag{1b}\]
Applications:
We assume that lower responses are better. If data follow a lognormal distribution the hypotheses are \[H_0:\frac{\mu_\text{T}}{\mu_\text{R}}\geq \log_{e}\delta\;vs\;H_1:\frac{\mu_\text{T}}{\mu_\text{R}}<\log_{e}\delta\tag{1a}\]
If data follow a normal distribution the hypotheses are \[H_0:\mu_\text{T}-\mu_\text{R}\geq \delta\;vs\;H_1:\mu_\text{T}-\mu_\text{R}<\delta\tag{1b}\]
Applications:
A basic knowledge of R is
required. To run the scripts at least version 1.4.9 (2019-12-19) of
PowerTOST
is suggested. Any version of R would likely do, though the current release of
PowerTOST
was only tested with version 4.2.3 (2023-03-15)
and later.
All scripts were run on a Xeon E3-1245v3 @ 3.40GHz (1/4 cores) 16GB RAM
with R 4.3.3 on Windows 7 build 7601, Service
Pack 1, Universal C Runtime 10.0.10240.16390.
Note that in the functions sampleN.noninf()
and
power.noninf()
the assumed coefficient of variation
CV
has to be given as a ratio and not in percent. If the
analysis is based on lognormal data by \(\small{(1\text{a})}\) or \(\small{(2\text{a})}\), the assumed
theta0
and margin \(\small{\delta}\) (margin
) have
to be given as ratios and not in percent. If the analysis is based on
normal data by \(\small{(1\text{b})}\) or \(\small{(2\text{b})}\),
theta0
andmargin
have to be given with the
original value. Data have to be continuous on a ratio scale, either
lognormal \(\small{\left(x\in\mathbb{R}^{+}=\{0<x\leq\infty\}\right)}\)
or normal \(\small{\left(x\in\mathbb{R}=\{-\infty\leq
x\leq+\infty\}\right)}\) distributed.
Count data (e.g., events), rates (0 – 1) and percentages, as
well as ordinal data (e.g., tmax) are not
supported.
sampleN.noninf()
gives balanced sequences for crossover
designs (i.e., the same number of subjects is allocated to all
sequences) or equal group sizes in a parallel design. Furthermore, the
estimated sample size is the total number of subjects, not
subjects per sequence or treatment arm – like in some other software
packages. The sample size functions of PowerTOST
use a
modification of Zhang’s method6 based on the large sample approximation as
the starting value of the iterations.
Most examples deal with studies where the response variables follow a
lognormal distribution, i.e., we assume a multiplicative model
(ratios instead of differences). We work with \(\small{\log_{e}}\)-transformed data in
order to allow analysis by the t-test (requiring differences).
This is the default in most functions of PowerTOST
and
hence, the argument logscale = TRUE
does not need to be
specified.
In software providing only a two-sided \(\small{100(1-2\,\alpha)}\) confidence interval for equivalence (e.g., Phoenix WinNonlin,7 PKanalix8): Use only the lower (for Non-Inferiority/Superiority) or upper (for Non-Superiority) confidence limit (see this article, Fig 3 for an example), which is one-sided \(\small{100(1-\alpha)}\).
It may sound picky but ‘sample size calculation’ (as used in most guidelines and alas, in some publications and textbooks) is sloppy terminology. In order to get prospective power (and hence, a sample size), we need five values:
1 – 2 are fixed by the agency,
3 is set by the sponsor, and
4 – 5 are just (uncertain!) assumptions.
In other words, obtaining a sample size is not an exact calculation like \(\small{2\times2=4}\) but always just an estimation.
“Power Calculation – A guess masquerading as mathematics.
Of note, it is extremely unlikely that all assumptions will be exactly realized in a particular study. Hence, calculating retrospective (a.k.a. post hoc, a posteriori) power is not only futile but plain nonsense.10
Since generally the within-subject variability is lower than the between-subject variability, crossover studies are so popular. The efficiency of a crossover study compared to a parallel study is given by \(\small{\frac{\sigma_\textrm{intra}^2\;+\,\sigma_\textrm{inter}^2}{0.5\,\times\,\sigma_\textrm{intra}^2}}\). If, say, \(\small{\sigma_\textrm{intra}^2=0.5\times\sigma_\textrm{inter}^2}\) in a paralled study we need six times as many subjects than in a crossover to obtain the same power. On the other hand, in a crossover we have two measurements per subject, which makes the parallel study approximately three times more costly.
Note that there is no relationship between \(\small{CV_\textrm{intra}}\) and \(\small{CV_\textrm{inter}}\). An example are drugs which are subjected to polymorphic metabolism. For these drugs \(\small{CV_\textrm{intra}\ll CV_\textrm{inter}}\). On the other hand, some HVD(P)s show \(\small{CV_\textrm{intra}>CV_\textrm{inter}}\).
It is a prerequisite that no – unequal – carryover from one period to
the next exists. Only then the comparison of treatments will be unbiased.
For details see another
article.11 Subjects have to be in the same
physiological state12 throughout the study – guaranteed by a
sufficiently long washout phase. Crossover studies cannot only be
performed in healthy volunteers but also in patients with a
stable disease (e.g., asthma).
In patients with an instable disease (e.g., in
oncology) or if adverse effects are unacceptable in healthy volunteers,
studies must be performed in a parallel design. If crossovers are
not feasible (e.g., for drugs with a very long half life),
studies could be performed in a parallel design as well.
The sample size cannot be
directly estimated,
only power calculated for an already given sample
size.
The power equations cannot be re-arranged to solve for sample size.
“Power. That which statisticians are always calculating but never have.
library(PowerTOST) # attach it to run the examples
Throughout the examples I am considering studies in a single center – not multiple groups within it or multicenter studies. That’s another cup of tea (for ‘problems’ in BE see another article).
Most methods of PowerTOST
are based on pairwise
comparisons. It is up to you to adjust the level of the test
alpha
if you want to compare more (e.g., two test
treatments vs a reference) in order to avoid inflation of the
family-wise
error rate due to multiplicity.
Say, we want to demonstrate Non-Inferiority in a 2×2×2
crossover-design, assume a CV of 25%, a T/R-ratio of 0.95,
\(\small{\delta}\) 0.8, and target a
power of at least 0.80.
Since alpha = 0.025
, theta0 = 0.95
,
margin = 0.8
, targetpower = 0.8
,
design = "2x2"
, and logscale = TRUE
are
defaults of the function, we don’t have to give them explicitly.
sampleN.noninf(CV = 0.25)
#
# ++++++++++++ Non-inferiority test +++++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover
# log-transformed data (multiplicative model)
#
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 0.8
# True ratio = 0.95, CV = 0.25
#
# Sample size (total)
# n power
# 36 0.820330
If you want to perform the analysis with untransformed data, specify
logscale = FALSE
. Then the defaults are
theta0 = -0.05
and margin = -0.2
.
sampleN.noninf(CV = 0.25, logscale = FALSE)
#
# ++++++++++++ Non-inferiority test +++++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover
# untransformed data (additive model)
#
# alpha = 0.025, target power = 0.8
# Non-inf. margin = -0.2
# True diff. = -0.05, CV = 0.25
#
# Sample size (total)
# n power
# 46 0.803507
Let’s return to lognormal distributed data because that’s more
common.
Say, you have information from a pilot study that the treatment performs
really (i.e., 30%) better than placebo. You are
cautious (good idea!), and assume a lower T/R-ratio and a
higher CV than the observed 1.30 and 0.25.
sampleN.noninf(CV = 0.28, theta = 1.25, margin = 1)
#
# ++++++++++++ Non-inferiority test +++++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover
# log-transformed data (multiplicative model)
#
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 1
# True ratio = 1.25, CV = 0.28
#
# Sample size (total)
# n power
# 26 0.802234
What about a parallel design? Likely the CV will be substantially higher.14
sampleN.noninf(CV = 0.50, theta = 1.25, margin = 1, design = "parallel")
#
# ++++++++++++ Non-inferiority test +++++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups
# log-transformed data (multiplicative model)
#
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 1
# True ratio = 1.25, CV = 0.5
#
# Sample size (total)
# n power
# 144 0.803753
I hear the ‘Guy in the Armani suit’15 shouting »C’mon, 72 subjects / arm, who shall pay for that? Hey, we have the wonder-drug! It works twice as good as snake oil!«
sampleN.noninf(CV = 0.50, theta = 2, margin = 1, design = "parallel")
#
# ++++++++++++ Non-inferiority test +++++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups
# log-transformed data (multiplicative model)
#
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 1
# True ratio = 2, CV = 0.5
#
# Sample size (total)
# n power
# 18 0.831844
Cross fingers that the drug performs really that great. If it is actually just 60% better than snake oil, power with this sample size will be only ≈51%. Master of disaster…
Possibly the ‘Guy in the Armani suit’ has read about ‘allocation ratios’ in the COVID-19 vaccination trials and asks »Why should we treat as many patients with snake oil than with our wonder-drug?«
Let’s see.
<- function(n, alloc) {
round.up return(as.integer(alloc * (n %/% alloc + as.logical(n %% alloc))))
}<- 0.50 # Total (pooled) CV
CV <- 2 # Assumed T/R-ratio
theta0 <- 1 # Non-Inferiority margin
margin <- 0.8 # Target (desired) power
target <- 3 # Allocation of wonder-drug (T)
alloc.T <- 1 # Allocation of snake oil (R)
alloc.R # conventional 1:1
<- sampleN.noninf(CV = CV, theta0 = theta0, margin = margin,
tmp design = "parallel", targetpower = target,
print = FALSE)
.0 <- as.integer(tmp[["Sample size"]])
n.0 <- tmp[["Achieved power"]]
pwr# 3:1 allocation (naïve)
.1 <- setNames(c(round.up(n.0 / (alloc.T + alloc.R) * alloc.T, alloc.T),
nround.up(n.0 / (alloc.T + alloc.R) * alloc.R, alloc.R)),
c("Test", "Reference"))
.1 <- power.noninf(CV = CV, theta0 = theta0, margin = margin,
pwrn = n.1, design = "parallel")
# 3:1 allocation (preserving power)
.2 <- n.1
nrepeat {# increase the sample size if necessary
.2 <- power.noninf(CV = CV, theta0 = theta0, margin = margin,
pwrn = n.2, design = "parallel")
if (pwr.2 >= target) break
.2[["Test"]] <- as.integer(n.2[["Test"]] + alloc.T)
n.2[["Reference"]] <- as.integer(n.2[["Reference"]] + alloc.R)
n
}<- paste0("%", nchar(as.character(n.0)), ".0f")
fmt cat("\n++++++++++++ Non-inferiority test +++++++++++++",
"\n Sample size estimation",
"\n-----------------------------------------------",
"\nStudy design: 2 parallel groups",
"\nlog-transformed data (multiplicative model)",
"\n\nalpha = 0.025, target power =", target,
"\nNon-inf. margin =", margin,
paste0("\nTrue ratio = ", theta0, ", CV = ", CV),
"\n\nTotal sample size =", n.0, "(1:1 allocation)",
paste0("\n n (T) = ", sprintf(fmt, n.0/2),
", n (R) = ", sprintf(fmt, n.0/2),
": power = ", signif(pwr.0, 6)),
"\nTotal sample size =", sum(n.1),
"(naïve", paste0(alloc.T, ":", alloc.R, " allocation)"),
"penalty", sprintf("%.0f%%", 100*(sum(n.1)/n.0-1)),
paste0("\n n (T) = ", sprintf(fmt, n.1[["Test"]]),
", n (R) = ", sprintf(fmt, n.1[["Reference"]]),
": power = ", signif(pwr.1, 6)),
"change", sprintf("%+.2f%%", 100 * (pwr.1 - pwr.0) / pwr.0),
"\nTotal sample size =", sum(n.2),
paste0("(", alloc.T, ":", alloc.R, " allocation)"),
sprintf("%13s %.0f%%", "penalty", 100*(sum(n.2)/n.0-1)),
paste0("\n n (T) = ", sprintf(fmt, n.2[["Test"]]),
", n (R) = ", sprintf(fmt, n.2[["Reference"]]),
": power = ", signif(pwr.2, 6)),
"change", sprintf("%+.2f%%", 100 * (pwr.2 - pwr.0) / pwr.0), "\n")
#
# ++++++++++++ Non-inferiority test +++++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups
# log-transformed data (multiplicative model)
#
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 1
# True ratio = 2, CV = 0.5
#
# Total sample size = 18 (1:1 allocation)
# n (T) = 9, n (R) = 9: power = 0.831844
# Total sample size = 20 (naïve 3:1 allocation) penalty 11%
# n (T) = 15, n (R) = 5: power = 0.766496 change -7.86%
# Total sample size = 24 (3:1 allocation) penalty 33%
# n (T) = 18, n (R) = 6: power = 0.844798 change +1.56%
Already in the naïve 3:1 allocation you have to round the sample size
up because the 18 of the 1:1 allocation is not a mutiple of 4.
Nevertheless, you loose 7.86% power. In order to preserve power, you
have to increase the sample size further.
However, it’s still based on a strong belief in the performance
of the wonder-drug. If it again turns out to be just 60% better than
snake oil, power with 24 subjects in the 3:1 allocation will be only
≈52%. Hardly better than tossing a coin.
Compare a new MR formulation (regimen once a day) with an IR formulation (twice a day). Cmax is the surrogate metric for safety (Non-Superiority) and Cmin is the surrogate metric for efficacy (Non-Inferiority):
“[…] therapeutic studies might be waived [if …]:
- there is a well-defined therapeutic window in terms of safety and efficacy, the rate of input is known not to influence the safety and efficacy profile or the risk for tolerance development and
- bioequivalence between the reference and the test product is shown in terms of AUC(0-τ),ss and
- Cmax,ss for the new MR formulation is below or equivalent to the Cmax,ss for the approved formulation and Cmin,ss for the MR formulation is above or equivalent to the Cmin,ss of the approved formulation.
Although not explicitly stated in the guideline, AFAIK the EMA expects tests at \(\small{\alpha=0.05}\).
Margins are 1.25 for Cmax and 0.80 for
Cmin. We assume CVs of 0.15 for
AUC, 0.20 for Cmax, 0.35 for
Cmin, T/R-ratios of 0.95 for AUC and
Cmin and 1.05 for Cmax. We plan
the study in a 2-treatment, 2-sequence, 4-period full replicate design due to the high
variability of Cmin.
Which PK metric leads the sample
size in such a Bioequivalence (AUC) / Non-Superiority
(Cmax) / Non-Inferiority (Cmin)
study?
<- "2x2x4"
design <- data.frame(design = "2x2x4", metric = c("AUC", "Cmax", "Cmin"),
x margin = c(NA, 1.25, 0.80), CV = c(0.15, 0.20, 0.35),
theta0 = c(0.95, 1.05, 0.95), n = NA_integer_,
power = NA_real_, stringsAsFactors = FALSE)
for (i in 1:nrow(x)) {
if (x$metric[i] == "AUC") {# ABE
6:7] <- sampleN.TOST(design = design,
x[i, theta0 = x$theta0[i],
CV = x$CV[i],
details = FALSE,
print = FALSE)[7:8]
if (x$n[i] < 12) {# minimum acc. to GLs
$n[i] <- 12
x$power[i] <- power.TOST(design = design,
xtheta0 = x$theta0[i],
CV = x$CV[i],
n = x$n[i])
}else { # Non-Inferiority, Non-Superiority
}6:7] <- sampleN.noninf(design = design,
x[i, alpha = 0.05,
margin = x$margin[i],
theta0 = x$theta0[i],
CV = x$CV[i],
details = FALSE,
print = FALSE)[6:7]
if (x$n[i] < 12) {# minimum acc. to GLs
$n[i] <- 12
x$power[i] <- power.noninf(design = design,
xalpha = 0.05,
margin = x$margin[i],
theta0 = x$theta0[i],
CV = x$CV[i],
n = x$n[i])
}
}
}$power <- signif(x$power, 4) # cosmetics
x$margin <- sprintf("%.2f", x$margin)
x$margin[x$margin == "NA"] <- "– "
xprint(x, row.names = FALSE)
cat(paste0("Sample size lead by ", x$metric[x$n == max(x$n)], ".\n"))
# design metric margin CV theta0 n power
# 2x2x4 AUC – 0.15 0.95 12 0.9881
# 2x2x4 Cmax 1.25 0.20 1.05 12 0.9098
# 2x2x4 Cmin 0.80 0.35 0.95 26 0.8184
# Sample size lead by Cmin.
However, with 26 subjects to show Non-Inferiority of Cmin the study is ‘overpowered’ (see this article) for BE of AUC and Non-Superiority of Cmax:
cat("Power with", max(x$n), "subjects for",
"\nAUC :",
power.TOST(design = design, CV = x$CV[1],
theta0 = x$theta0[1], n = max(x$n)),
"\nCmax:",
power.noninf(design = design, alpha = 0.05, margin = 1.25,
CV = x$CV[3], theta0 = x$theta0[3], n = max(x$n)), "\n")
# Power with 26 subjects for
# AUC : 0.9999851
# Cmax: 0.9974663
That gives us some space to navigate for e.g., Cmax if values turn out to be ‘worse’ (say, CV 0.20 → 0.25, T/R-ratio 1.05 → 1.10):
power.noninf(design = design, alpha = 0.05, margin = x$margin[3],
CV = 0.25, theta0 = 1.10, n = max(x$n)) # higher CV, worse theta0
# [1] 0.8359967
The bracketing approach may require a lower sample size than required for demonstrating BE with the common CI-inclusion approach for all PK metrics, which is another option mentioned in the guideline.3 Note that reference-scaling by ABEL is acceptable for Cmax5 16 and Cmin5 if their CVwR >30%, expanding the limits can be justified based on clinical grounds, and CVwR > 30% is not caused by ‘outliers’. How does that compare?
<- data.frame(design = design, method = "ABE",
y metric = c("AUC", "Cmax", "Cmin"),
CV = c(0.15, 0.20, 0.35),
theta0 = c(0.95, 1.05, 0.90),
L = 0.8, U = 1.25, n = NA_integer_,
power = NA_real_, stringsAsFactors = FALSE)
for (i in 1:nrow(y)) {
if (y$metric[i] == "AUC" | y$CV[i] <= 0.3) {
8:9] <- sampleN.TOST(CV = y$CV[i], theta0 = y$theta0[i],
y[i, design = design, print = FALSE,
details = FALSE)[7:8]
if (y$n[i] < 12) {# minimum acc. to the GL
$n[i] <- 12
y$power[i] <- power.TOST(CV = y$CV[i], theta0 = y$theta0[i],
ydesign = design, n = y$n[i])
}else {
}$method[i] <- "ABEL"
y6:7] <- scABEL(CV = y$CV[i])
y[i, 8:9] <- sampleN.scABEL(CV = y$CV[i], theta0 = y$theta0[i],
y[i, design = design, print = FALSE,
details = FALSE)[8:9]
}
}$L <- sprintf("%.2f%%", 100 * y$L) # cosmetics
y$U <- sprintf("%.2f%%", 100 * y$U)
y$power <- signif(y$power, 4)
ynames(y)[6:7] <- c("L ", "U ")
print(y, row.names = FALSE)
# design method metric CV theta0 L U n power
# 2x2x4 ABE AUC 0.15 0.95 80.00% 125.00% 12 0.9881
# 2x2x4 ABE Cmax 0.20 1.05 80.00% 125.00% 12 0.9085
# 2x2x4 ABEL Cmin 0.35 0.90 77.23% 129.48% 34 0.8118
Which approach is optimal is a case-to-case decision. Although in this example bracketing is the ‘winner’ (26 subjects instead of 34), it might be problematic if a CV is larger and/or a T/R-ratio worse than assumed: CV of AUC 0.15 → 0.20, Cmax 0.20 → 0.25, Cmin 0.35 → 0.50; T/R-ratio of AUC 0.95 → 0.90, Cmax 1.05 → 1.12, Cmin 0.90 → 0.88.
<- max(y$n)
n <- data.frame(approach = c("ABE", "Non-Superiority", "ABE",
z "Non-Inferiority", "ABE"),
metric = c("AUC", rep(c("Cmax", "Cmin"), each = 2)),
CV = c(0.2, rep(c(0.25, 0.50), each = 2)),
theta0 = c(0.90, rep(c(1.12, 0.88), each = 2)),
margin = c(NA, 1.25, NA, 0.80, NA),
L = c(0.80, NA, 0.80, NA, 0.80),
U = c(1.25, NA, 1.25, NA, 1.25),
n = n, power = NA_real_,
stringsAsFactors = FALSE)
for (i in 1:nrow(z)) {
if (z$approach[i] %in% c("Non-Superiority", "Non-Inferiority")) {
$power[i] <- power.noninf(design = design,
zalpha = 0.05,
margin = z$margin[i],
theta0 = z$theta0[i],
CV = z$CV[i],
n = z$n[i])
else {
}if (z$CV[i] <= 0.3) {
$power[i] <- power.TOST(design = design,
ztheta0 = z$theta0[i],
CV = z$CV[i],
n = z$n[i])
else {
}$approach[i] <- "ABEL"
z6:7] <- scABEL(CV = z$CV[i])
z[i, $power[i] <- power.scABEL(design = design,
ztheta0 = z$theta0[i],
CV = z$CV[i],
n = z$n[i])
}
}
}$L <- sprintf("%.2f%%", 100 * z$L) # cosmetics
z$U <- sprintf("%.2f%%", 100 * z$U)
z$power <- signif(z$power, 4)
z$margin <- sprintf("%.2f", z$margin)
z$margin[z$margin == "NA"] <- "– "
z$L[z$L == "NA%"] <- "– "
z$U[z$U == "NA%"] <- "– "
znames(z)[6:7] <- c("L ", "U ")
print(z, row.names = FALSE)
# approach metric CV theta0 margin L U n power
# ABE AUC 0.20 0.90 – 80.00% 125.00% 34 0.9640
# Non-Superiority Cmax 0.25 1.12 1.25 – – 34 0.8258
# ABE Cmax 0.25 1.12 – 80.00% 125.00% 34 0.8258
# Non-Inferiority Cmin 0.50 0.88 0.80 – – 34 0.3169
# ABEL Cmin 0.50 0.88 – 69.84% 143.19% 34 0.8183
Hence, in this case the equivalence approach by ABE(L) is the ‘winner’ because it tolerates more deviations from assumptions.
What happens if you fail to convince the agency that ABEL is acceptable? The picture changes.
<- data.frame(design = design, method = "ABE",
a metric = c("AUC", "Cmax", "Cmin"),
CV = c(0.15, 0.20, 0.35),
theta0 = c(0.95, 1.05, 0.90),
L = 0.8, U = 1.25, n = NA_integer_,
power = NA_real_, stringsAsFactors = FALSE)
for (i in 1:nrow(a)) {
8:9] <- sampleN.TOST(CV = a$CV[i], theta0 = a$theta0[i],
a[i, design = design, print = FALSE,
details = FALSE)[7:8]
if (a$n[i] < 12) {# minimum acc. to the GL
$n[i] <- 12
a$power[i] <- power.TOST(CV = a$CV[i], theta0 = a$theta0[i],
adesign = design, n = a$n[i])
}
}$L <- sprintf("%.2f%%", 100 * a$L) # cosmetics
a$U <- sprintf("%.2f%%", 100 * a$U)
a$power <- signif(a$power, 4)
anames(a)[6:7] <- c("L ", "U ")
print(a, row.names = FALSE)
# design method metric CV theta0 L U n power
# 2x2x4 ABE AUC 0.15 0.95 80.00% 125.00% 12 0.9881
# 2x2x4 ABE Cmax 0.20 1.05 80.00% 125.00% 12 0.9085
# 2x2x4 ABE Cmin 0.35 0.90 80.00% 125.00% 52 0.8003
Nasty – we need a ≈53% larger sample size.
If all values turn out to be as worse as above:
<- data.frame(design = design, method = "ABE",
b metric = c("AUC", "Cmax", "Cmin"),
CV = c(0.20, 0.25, 0.50),
theta0 = c(0.90, 1.12, 0.88),
L = 0.8, U = 1.25, n = NA_integer_,
power = NA_real_, stringsAsFactors = FALSE)
for (i in 1:nrow(a)) {
8:9] <- sampleN.TOST(CV = b$CV[i], theta0 = b$theta0[i],
b[i, design = design, print = FALSE,
details = FALSE)[7:8]
if (b$n[i] < 12) {# minimum acc. to the GL
$n[i] <- 12
b$power[i] <- power.TOST(CV = b$CV[i], theta0 = b$theta0[i],
bdesign = design, n = b$n[i])
}
}$L <- sprintf("%.2f%%", 100 * b$L) # cosmetics
b$U <- sprintf("%.2f%%", 100 * b$U)
b$power <- signif(b$power, 4)
bnames(b)[6:7] <- c("L ", "U ")
print(b, row.names = FALSE)
# design method metric CV theta0 L U n power
# 2x2x4 ABE AUC 0.20 0.90 80.00% 125.00% 18 0.8007
# 2x2x4 ABE Cmax 0.25 1.12 80.00% 125.00% 32 0.8050
# 2x2x4 ABE Cmin 0.50 0.88 80.00% 125.00% 154 0.8038
End of the story. Recall that this is a study in a 2-treatment, 2-sequence, 4-period full replicate design.
<- data.frame(design = "2x2x4",
c approach = c("ABE", "Non-Superiority", "Non-Inferiority"),
metric = c("AUC", "Cmax", "Cmin"),
margin = c(NA, 1.25, 0.80), CV = c(0.20, 0.25, 0.50),
theta0 = c(0.90, 1.12, 0.88), n = NA_integer_,
power = NA_real_, stringsAsFactors = FALSE)
for (i in 1:nrow(c)) {
if (c$approach[i] == "ABE") {# ABE
7:8] <- sampleN.TOST(CV = c$CV[i], theta0 = c$theta0[i],
c[i, design = c$design[i], print = FALSE,
details = FALSE)[7:8]
if (c$n[i] < 12) {# minimum acc. to the GL
$n[i] <- 12
c$power[i] <- power.TOST(CV = c$CV[i], theta0 = c$theta0[i],
cdesign = c$design[i], n = c$n[i])
}else { # Non-Inferiority, Non-Superiority
}7:8] <- sampleN.noninf(alpha = 0.05, CV = c$CV[i],
c[i, margin = c$margin[i], theta0 = c$theta0[i],
design = c$design[i], details = FALSE,
print = FALSE)[6:7]
if (c$n[i] < 12) {# minimum acc. to GLs
$n[i] <- 12
c$power[i] <- power.noninf(alpha = 0.05, CV = c$CV[i],
cmargin = c$margin[i], theta0 = c$theta0[i],
design = c$design[i], n = c$n[i])
}
}
}$power <- signif(c$power, 4) # cosmetics
c$margin <- sprintf("%.2f", c$margin)
c$margin[c$margin == "NA"] <- "– "
cprint(c, row.names = FALSE)
# design approach metric margin CV theta0 n power
# 2x2x4 ABE AUC – 0.20 0.90 18 0.8007
# 2x2x4 Non-Superiority Cmax 1.25 0.25 1.12 32 0.8050
# 2x2x4 Non-Inferiority Cmin 0.80 0.50 0.88 154 0.8038
As an aside, we would need also require 154 subjects to demonstrate Non-Inferiority of Cmin in the bracketing approach. Perhaps it is readily more economical to opt for a clinical trial…
PowerTOST
validated?R
and its
SDLC.19 R
is updated every couple of months
with documented changes20 and maintaining a bug-tracking system.21 I
recommend to use always the latest release.PowerTOST
tried to do their best to provide
reliable and valid results. The package’s NEWS
documents its development, bug-fixes, and introduction of new methods.
Issues are tracked at GitHub (as of
today none is still open). So far the package had >113,000 downloads.
Therefore, it is extremely unlikely that bugs were not detected given
its large user base.
Helmut Schütz 2024
R
, PowerTOST
, and
arsenal
GPL 3.0,
klippy
MIT,
pandoc
GPL 2.0.
1st version July 24, 2022. Rendered April 8, 2024 14:03 CEST
by rmarkdown
via pandoc in 0.36 seconds.
Labes D, Schütz H, Lang B. PowerTOST: Power and Sample Size for (Bio)Equivalence Studies. Package version 1.5.6. 2024-03-18. CRAN.↩︎
Labes D, Schütz H, Lang B. Package ‘PowerTOST’. March 18, 2024. CRAN.↩︎
Chow S-C, Shao J, Wang H. Sample Size Calculations in Clinical Research. New York: Marcel Dekker; 2003. Chapter 3.↩︎
Julious SA. Sample Sizes for Clinical Trials. Boca Raton: CRC Press; 2010. Chapter 4.↩︎
EMA, CHMP. Guideline on the pharmacokinetic and clinical evaluation of modified release dosage forms. London. 20 November 2014. Online.↩︎
Zhang P. A Simple Formula for Sample Size Calculation in Equivalence Studies. J Biopharm Stat. 2003; 13(3): 529–38. doi:10.1081/BIP-120022772.↩︎
Certara USA, Inc. Princeton, NJ. 2023. Phoenix WinNonlin. Online.↩︎
Senn S. Guernsey McPearson’s Drug Development Dictionary. April 21, 2020. Online.↩︎
Hoenig JM, Heisey DM. The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. Am Stat. 2001; 55(1): 19–24. doi:10.1198/000313001300339897. Open Access.↩︎
In short: There is no statistical method to ‘correct’ for unequal carryover. It can only be avoided by design, i.e., a sufficiently long washout between periods. According to the guidelines subjects with pre-dose concentrations > 5% of their Cmax can by excluded from the comparison if stated in the protocol.↩︎
Especially important for drugs which are auto-inducers or -inhibitors and biologics.↩︎
Senn S. Statistical Issues in Drug Development. Chichester: John Wiley; 2nd ed 2007.↩︎
It depends on both the within- and between-subject variances. In general the latter is larger than the former (see above).↩︎
‘The Guy in the Armani suit’ (© ElMaestro, introduced there) is a running gag in the BEBA Forum. They are only proficient in PowerPoint, copypasting from one document to an other, and shouting »You are Fired!« if a study fails.↩︎
EMA, CHMP. Guideline on the Investigation of Bioequivalence. London. 20 January 2010. Online.↩︎
Any power < 50% is a failure by definition.↩︎
The R Foundation for Statistical Computing. A Guidance Document for the Use of R in Regulated Clinical Trial Environments. Vienna. October 18, 2021. Online.↩︎
The R Foundation for Statistical Computing. R: Software Development Life Cycle. A Description of R’s Development, Testing, Release and Maintenance Processes. Vienna. October 18, 2021. Online.↩︎
FDA. Statistical Software Clarifying Statement. May 6, 2015. Download.↩︎
WHO. Guidance for organizations performing in vivo bioequivalence studies. Geneva. May 2016. Technical Report Series No. 996, Annex 9. Section 4. Online.↩︎
ICH. Good Clinical Practice (GCP). E6(R3) – Draft. 19 May 2023. Section 4.5. Online.↩︎