Consider allowing JavaScript. Otherwise, you have to be proficient in reading since formulas will not be rendered. Furthermore, the table of contents in the left column for navigation will not be available and code-folding not supported. Sorry for the inconvenience.

Examples in this article were generated with 4.2.1 by the package PowerTOST.1 See also its Online manual2 for details.

• The right-hand badges give the respective section’s ‘level’.

1. Basics about sample size methodology – requiring no or only limited statistical expertise.

1. These sections are the most important ones. They are – hopefully – easily comprehensible even for novices.

1. A somewhat higher knowledge of statistics and/or R is required. May be skipped or reserved for a later reading.
• Click to show / hide R code.
Abbreviation Meaning
$$\small{\alpha}$$ Nominal level of the test, probability of Type I Error, patient’s risk
$$\small{\beta}$$ Probability of Type II Error, producer’s risk
(A)BE (Average) Bioequivalence
ABEL Average Bioequivalence with Expanding Limits
CI, CL Confidence Interval, Limit
$$\small{CV}$$ Coefficient of Variation
$$\small{CV_\textrm{inter}}$$ Between-subject Coefficient of Variation
$$\small{CV_\textrm{intra}}$$ Within-subject Coefficient of Variation
$$\small{CV_\textrm{wR}}$$ Within-subject Coefficient of Variation of the Reference treatment
$$\small{\delta}$$ Margin of clinical relevance in Non-Inferiority and Non-Superiority
$$\small{H_0}$$ Null hypothesis
$$\small{H_1}$$ Alternative hypothesis (also $$\small{H_\textrm{a}}$$)
L, U Lower and upper limits in ABE(L)
$$\small{\mu_\text{T},\,\mu_\text{R}}$$ True mean of the Test and Reference treatment, respectively
$$\small{\pi}$$ Prospective power ($$\small{1-\beta}$$)
TOST Two One-Sided Tests

# Introduction

What are the main statistical issues in planning a confirmatory experiment?

For details about inferential statistics and hypotheses see another article.

An ‘optimal’ study design is one, which – taking all assumptions into account – has a reasonably high chance of demonstrating non-inferiority or non-superiority (power) whilst controlling the patient’s risk.

Contrary to Bioequivalence (BE), where a study is assessed with $$\small{\alpha=0.05}$$ by TOST (or by a $$\small{100\,(1-2\times0.05)}$$ Confidence Interval), in Non-Inferiority and Non-Superiority a single one-sided test with $$\small{\alpha=0.025}$$ is employed.

Based on a ‘clinically relevant margin’ $$\small{\delta}$$ we have different hypotheses.

## Non-Inferiority

We assume that higher responses are better. If data follow a lognormal distribution the hypotheses are $H_0:\frac{\mu_\text{T}}{\mu_\text{R}}\leq \log_{e}\delta\;vs\;H_1:\frac{\mu_\text{T}}{\mu_\text{R}}>\log_{e}\delta\tag{1a}$ Fig. 1 $$\small{\delta=0.8}$$ (x-axis in log-scale).

If data follow a normal distribution the hypotheses are $H_0:\mu_\text{T}-\mu_\text{R}\leq \delta\;vs\;H_1:\mu_\text{T}-\mu_\text{R}>\delta\tag{1b}$

Applications:

• Clinical phase III trials comparing a new treatment with placebo or an established treatment (efficacy).
• Comparing minimum concentrations of a new MR formulation with the ones of an approved IR formulation as a surrogate of efficacy.3

## Non-Superiority

We assume that lower responses are better. If data follow a lognormal distribution the hypotheses are $H_0:\frac{\mu_\text{T}}{\mu_\text{R}}\geq \log_{e}\delta\;vs\;H_1:\frac{\mu_\text{T}}{\mu_\text{R}}<\log_{e}\delta\tag{1a}$ Fig. 2 $$\small{\delta=1.25}$$ (x-axis in log-scale).

If data follow a normal distribution the hypotheses are $H_0:\mu_\text{T}-\mu_\text{R}\geq \delta\;vs\;H_1:\mu_\text{T}-\mu_\text{R}<\delta\tag{1b}$

Applications:

• Clinical phase III trials comparing AEs of a new treatment with placebo or an established treatment (safety).
• Comparing maximum concentrations of a new MR formulation with the ones of an approved IR formulation as a surrogate of safety.3

top of section ↩︎

## Preliminaries

A basic knowledge of R is required. To run the scripts at least version 1.4.9 (2019-12-19) of PowerTOST is suggested. Any version of R would likely do, though the current release of PowerTOST was only tested with version 4.1.3 (2022-03-10) and later.
All scripts were run on a Xeon E3-1245v3 @ 3.40GHz (1/4 cores) 16GB RAM with R 4.2.1 on Windows 7 build 7601, Service Pack 1, Universal C Runtime 10.0.10240.16390.

Note that in the functions sampleN.noninf() and power.noninf()the assumed coefficient of variation CV has to be given as a ratio and not in percent. If the analysis is based on lognormal data by $$\small{(1\text{a})}$$ or $$\small{(2\text{a})}$$, the assumed theta0 and margin $$\small{\delta}$$ (margin) have to be given as ratios and not in percent. If the analysis is based on normal data by $$\small{(1\text{b})}$$ or $$\small{(2\text{b})}$$, theta0 andmargin have to be given with the original value. Data have to be continuous on a ratio scale, either lognormal $$\small{\left(x\in\mathbb{R}^{+}=\{0<x\leq\infty\}\right)}$$ or normal $$\small{\left(x\in\mathbb{R}=\{-\infty\leq x\leq+\infty\}\right)}$$ distributed.
Count data (e.g., events), rates (0 – 1) and percentages, as well as ordinal data (e.g., tmax) are not supported.

sampleN.noninf() gives balanced sequences for crossover designs (i.e., the same number of subjects is allocated to all sequences) or equal group sizes in a parallel design. Furthermore, the estimated sample size is the total number of subjects, not subjects per sequence or treatment arm – like in some other software packages. The sample size functions of PowerTOST use a modification of Zhang’s method4 based on the large sample approximation as the starting value of the iterations.

Most examples deal with studies where the response variables follow a lognormal distribution, i.e., we assume a multiplicative model (ratios instead of differences). We work with $$\small{\log_{e}}$$-transformed data in order to allow analysis by the t-test (requiring differences). This is the default in most functions of PowerTOST and hence, the argument logscale = TRUE does not need to be specified.

previous section ↩︎

## Terminology

It may sound picky but ‘sample size calculation’ (as used in most guidelines and alas, in some publications and textbooks) is sloppy terminology. In order to get prospective power (and hence, a sample size), we need five values:

1. The level of the test $$\small{\alpha}$$ (in Non-Superiority / Non-Inferiority commonly 0.025),
2. the clinicall relevant margin $$\small{\delta}$$,
3. the desired (or target) power $$\small{\pi}$$,
4. the variance (commonly expressed as a coefficient of variation), and
5. the deviation of the test from the reference treatment.

1 – 2 are fixed by the agency,
3 is set by the sponsor, and
4 – 5 are just (uncertain!) assumptions.

In other words, obtaining a sample size is not an exact calculation like $$\small{2\times2=4}$$ but always just an estimation.

Power Calculation – A guess masquerading as mathematics.
Stephen Senn (2020)5
Realization: Ob­ser­vations (in a sample) of a random variable (of the population).

Of note, it is extremely unlikely that all assumptions will be exactly realized in a particular study. Hence, calculating retrospective (a.k.a. post hoc, a posteriori) power is not only futile but plain nonsense.6

Since generally the within-subject variability is lower than the between-subject variability, crossover studies are so popular. The efficiency of a crossover study compared to a parallel study is given by $$\small{\frac{\sigma_\textrm{intra}^2\;+\,\sigma_\textrm{inter}^2}{0.5\,\times\,\sigma_\textrm{intra}^2}}$$. If, say, $$\small{\sigma_\textrm{intra}^2=0.5\times\sigma_\textrm{inter}^2}$$ in a paralled study we need six times as many subjects than in a crossover to obtain the same power. On the other hand, in a crossover we have two measurements per subject, which makes the parallel study approximately three times more costly.

Note that there is no relationship between $$\small{CV_\textrm{intra}}$$ and $$\small{CV_\textrm{inter}}$$. An example are drugs which are subjected to polymorphic metabolism. For these drugs $$\small{CV_\textrm{intra}\ll CV_\textrm{inter}}$$. On the other hand, some HVD(P)s show $$\small{CV_\textrm{intra}>CV_\textrm{inter}}$$.

Carryover: A resi­dual effect of a previous period.

It is a prerequisite that no carryover from one period to the next exists. Only then the comparison of treatments will be unbiased. For details see another article.7 Subjects have to be in the same physiological state8 throughout the study – guaranteed by a sufficiently long wash­out phase. Crossover studies cannot only be performed in healthy volunteers but also in patients with a stable disease (e.g., asthma). Studies in patients with an instable disease (e.g., in oncology) must be performed in a parallel design.
If crossovers are not feasible (e.g., for drugs with a very long half life), studies could be performed in a parallel design as well.

# Power → Sample size

The sample size cannot be directly estimated,
only power calculated for an already given sample size.

The power equations cannot be re-arranged to solve for sample size.

Power. That which statisticians are always calculating but never have.
Stephen Senn (2007)9

# Examples

library(PowerTOST) # attach it to run the examples

Note that in Non-Inferiority and Non-Superiority – contrary to other functions of the package – a one-sided t-test (instead of TOST aiming at equivalence) is employed.
Throughout the examples I’m referring to studies in a single center – not multiple groups within them or multicenter studies. That’s another pot of tea.

Most methods of PowerTOST are based on pairwise comparisons. It is up to you to adjust the level of the test alpha if you want to compare more (say, two test treatments vs a reference or each of them against one of the others) in order to avoid inflation of the family-wise error rate due to multiplicity.

Say, we want to demonstrate Non-Inferiority in a 2×2×2 crossover-design, assume a CV of 25%, a T/R-ratio of 0.95, $$\small{\delta}$$ 0.8, and target a power of at least 0.80.
Since alpha = 0.025, theta0 = 0.95, margin = 0.8, targetpower = 0.8, design = "2x2", and logscale = TRUE are defaults of the function we don’t have to give them explicitly.

sampleN.noninf(CV = 0.25)
#
# ++++++++++++ Non-inferiority test +++++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover
# log-transformed data (multiplicative model)
#
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 0.8
# True ratio = 0.95,  CV = 0.25
#
# Sample size (total)
#  n     power
# 36   0.820330

If you want to perform the analysis with untransformed data, specify logscale = TRUE. Then the defaults are theta0 = -0.05 and margin = -0.2.

sampleN.noninf(CV = 0.25, logscale = FALSE)
#
# ++++++++++++ Non-inferiority test +++++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover
#
# alpha = 0.025, target power = 0.8
# Non-inf. margin = -0.2
# True diff. = -0.05,  CV = 0.25
#
# Sample size (total)
#  n     power
# 46   0.803507

Say, you have information from a pilot study that the treatment performs really (i.e., 30%) better than placebo. You are cautious (good idea!), and assume a lower T/R-ratio and a higher CV than the observed 1.30 and 0.25.

sampleN.noninf(CV = 0.28, theta = 1.25, margin = 1)
#
# ++++++++++++ Non-inferiority test +++++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover
# log-transformed data (multiplicative model)
#
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 1
# True ratio = 1.25,  CV = 0.28
#
# Sample size (total)
#  n     power
# 26   0.802234

What about a parallel design? Likely the CV will be substantially higher.10

sampleN.noninf(CV = 0.50, theta = 1.25, margin = 1, design = "parallel")
#
# ++++++++++++ Non-inferiority test +++++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups
# log-transformed data (multiplicative model)
#
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 1
# True ratio = 1.25,  CV = 0.5
#
# Sample size (total)
#  n     power
# 144   0.803753

I hear the ‘Guy in the Armani suit’11 shouting »C’mon, 72 subjects / arm, who shall pay for that? Hey, we have the wonder-drug! It works twice as good as snake oil!«

sampleN.noninf(CV = 0.50, theta = 2, margin = 1, design = "parallel")
#
# ++++++++++++ Non-inferiority test +++++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups
# log-transformed data (multiplicative model)
#
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 1
# True ratio = 2,  CV = 0.5
#
# Sample size (total)
#  n     power
# 18   0.831844

Cross fingers that the drug performs really that great. If it is actually just 60% better than snake oil, power with this sample will be only ≈51%. Master of disaster…

Possibly the ‘Guy in the Armani suit’ has read about ‘allocation ratios’ in the COVID-19 vaccination trials and asks »Why should we treat as many patients with snake oil than with our wonder-drug?«

Let’s see.

round.up <- function(n, alloc) {
return(as.integer(alloc * (n %/% alloc + as.logical(n %% alloc))))
}
CV      <- 0.50 # Total (pooled) CV
theta0  <- 2    # Assumed T/R-ratio
margin  <- 1    # Non-Inferiority margin
target  <- 0.8  # Target (desired) power
alloc.T <- 3    # Allocation of wonder-drug (T)
alloc.R <- 1    # Allocation of snake oil (R)
# conventional 1:1
tmp     <- sampleN.noninf(CV = CV, theta0 = theta0, margin = margin,
design = "parallel", targetpower = target,
print = FALSE)
n.0     <- as.integer(tmp[["Sample size"]])
pwr.0   <- tmp[["Achieved power"]]
# 3:1 allocation (naïve)
n.1     <- setNames(c(round.up(n.0 / (alloc.T + alloc.R) * alloc.T, alloc.T),
round.up(n.0 / (alloc.T + alloc.R) * alloc.R, alloc.R)),
c("Test", "Reference"))
pwr.1   <- power.noninf(CV = CV, theta0 = theta0, margin = margin,
n = n.1, design = "parallel")
# 3:1 allocation (preserving power)
n.2     <- n.1
repeat {# increase the sample size if necessary
pwr.2 <- power.noninf(CV = CV, theta0 = theta0, margin = margin,
n = n.2, design = "parallel")
if (pwr.2 >= target) break
n.2[["Test"]]      <- as.integer(n.2[["Test"]] + alloc.T)
n.2[["Reference"]] <- as.integer(n.2[["Reference"]] + alloc.R)
}
fmt <- paste0("%", nchar(as.character(n.0)), ".0f")
cat("\n++++++++++++ Non-inferiority test +++++++++++++",
"\n            Sample size estimation",
"\n-----------------------------------------------",
"\nStudy design: 2 parallel groups",
"\nlog-transformed data (multiplicative model)",
"\n\nalpha = 0.025, target power =", target,
"\nNon-inf. margin =", margin,
paste0("\nTrue ratio = ", theta0, ", CV = ", CV),
"\n\nTotal sample size =", n.0, "(1:1 allocation)",
paste0("\n  n (T) = ", sprintf(fmt, n.0/2),
", n (R) = ", sprintf(fmt, n.0/2),
": power = ", signif(pwr.0, 6)),
"\nTotal sample size =", sum(n.1),
"(naïve", paste0(alloc.T, ":", alloc.R, " allocation)"),
"penalty", sprintf("%.0f%%", 100*(sum(n.1)/n.0-1)),
paste0("\n  n (T) = ", sprintf(fmt, n.1[["Test"]]),
", n (R) = ", sprintf(fmt, n.1[["Reference"]]),
": power = ", signif(pwr.1, 6)),
"change", sprintf("%+.2f%%", 100 * (pwr.1 - pwr.0) / pwr.0),
"\nTotal sample size =", sum(n.2),
paste0("(", alloc.T, ":", alloc.R, " allocation)"),
sprintf("%13s %.0f%%", "penalty", 100*(sum(n.2)/n.0-1)),
paste0("\n  n (T) = ", sprintf(fmt, n.2[["Test"]]),
", n (R) = ", sprintf(fmt, n.2[["Reference"]]),
": power = ", signif(pwr.2, 6)),
"change", sprintf("%+.2f%%", 100 * (pwr.2 - pwr.0) / pwr.0), "\n")
#
# ++++++++++++ Non-inferiority test +++++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups
# log-transformed data (multiplicative model)
#
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 1
# True ratio = 2, CV = 0.5
#
# Total sample size = 18 (1:1 allocation)
#   n (T) =  9, n (R) =  9: power = 0.831844
# Total sample size = 20 (naïve 3:1 allocation) penalty 11%
#   n (T) = 15, n (R) =  5: power = 0.766496 change -7.86%
# Total sample size = 24 (3:1 allocation)       penalty 33%
#   n (T) = 18, n (R) =  6: power = 0.844798 change +1.56%

Already in the naïve 3:1 allocation you have to round the sample size up because the 18 of the 1:1 allocation is not a mutiple of 4. Nevertheless, you loose 7.86% power. In order to preserve power, you have to increase the sample size further.
However, it’s still based on a strong belief in the performance of the wonder-drug. If it again turns out to be just 60% better than snake oil, power with 24 subjects in the 3:1 allocation will be only ≈52%. Hardly better than tossing a coin.

A special case: Bracketing approach (EMA)

Compare a new MR formulation (regimen once a day) with an IR formulation (twice a day). Cmax is the surrogate target metric for safety (Non-Superiority) and Cmin the surrogate for efficacy (Non-Inferiority):

[…] therapeutic studies might be waived [if …]:
• there is a well-defined therapeutic window in terms of safety and efficacy, the rate of input is known not to influence the safety and efficacy profile or the risk for tolerance development and
• bioequivalence between the reference and the test product is shown in terms of AUC(0-τ),ss and
• Cmax,ss for the new MR formulation is below or equivalent to the Cmax,ss for the approved formulation and Cmin,ss for the MR formulation is above or equi­valent to the Cmin,ss of the approved formulation.
EMA (2014)3

Although not explicitly stated in the guideline, AFAIK the EMA expects tests at $$\small{\alpha=0.05}$$.

Margins are 1.25 for Cmax and 0.80 for Cmin. We assume CVs of 0.15 for AUC, 0.20 for Cmax, 0.35 for Cmin, T/R-ratios of 0.95 for AUC and Cmin and 1.05 for Cmax. We plan the study in a two-treatment, two-sequence, four-period full replicate design due to the high variability of Cmin.
Which PK metric leads the sample size in such a Bioequivalence (AUC) / Non-Superiority (Cmax) / Non-In­fe­ri­ority (Cmin)  study?

design <- "2x2x4"
x      <- data.frame(design = "2x2x4", metric = c("AUC", "Cmax", "Cmin"),
margin = c(NA, 1.25, 0.80), CV = c(0.15, 0.20, 0.35),
theta0 = c(0.95, 1.05, 0.95), n = NA_integer_,
power = NA_real_, stringsAsFactors = FALSE)
for (i in 1:nrow(x)) {
if (x$metric[i] == "AUC") {# ABE x[i, 6:7] <- sampleN.TOST(design = design, theta0 = x$theta0[i],
CV      = x$CV[i], details = FALSE, print = FALSE)[7:8] if (x$n[i] < 12) {# minimum acc. to GLs
x$n[i] <- 12 x$power[i] <- power.TOST(design  = design,
theta0  = x$theta0[i], CV = x$CV[i],
n       = x$n[i]) } }else { # Non-Inferiority, Non-Superiority x[i, 6:7] <- sampleN.noninf(design = design, alpha = 0.05, margin = x$margin[i],
theta0  = x$theta0[i], CV = x$CV[i],
details = FALSE,
print   = FALSE)[6:7]
if (x$n[i] < 12) {# minimum acc. to GLs x$n[i]     <- 12
x$power[i] <- power.noninf(design = design, alpha = 0.05, margin = x$margin[i],
theta0  = x$theta0[i], CV = x$CV[i],
n       = x$n[i]) } } } x$power  <- signif(x$power, 4) # cosmetics x$margin <- sprintf("%.2f", x$margin) x$margin[x$margin == "NA"] <- "– " print(x, row.names = FALSE) cat(paste0("Sample size lead by ", x$metric[x$n == max(x$n)], ".\n"))
#  design metric margin   CV theta0  n  power
#   2x2x4    AUC    –   0.15   0.95 12 0.9881
#   2x2x4   Cmax   1.25 0.20   1.05 12 0.9098
#   2x2x4   Cmin   0.80 0.35   0.95 26 0.8184
# Sample size lead by Cmin.

However, with 26 subjects to show Non-Inferiority of Cmin the study is ‘overpowered’ for BE of AUC and Non-Superiority of Cmax.

cat("Power with", max(x$n), "subjects for", "\nAUC :", power.TOST(design = design, CV = x$CV,
theta0 = x$theta0, n = max(x$n)),
"\nCmax:",
power.noninf(design = design, alpha = 0.05, margin = 1.25,
CV = x$CV, theta0 = x$theta0, n = max(x$n)), "\n") # Power with 26 subjects for # AUC : 0.9999851 # Cmax: 0.9974663 That gives us some space to navigate for e.g., Cmax if values turn out to be ‘worse’ (say, CV 0.20 → 0.25, T/R-ratio 1.05 → 1.10): power.noninf(design = design, alpha = 0.05, margin = x$margin,
CV = 0.25, theta0 = 1.10, n = max(x$n)) # higher CV, worse theta0 #  0.8359967 The bracketing approach may require a lower sample size than required for demonstrating BE with the common CI-inclusion approach for all PK metrics, which is another option mentioned in the guideline.3 Note that reference-scaling by ABEL is acceptable for Cmax and Cmin if their CVwR >30%, expanding the limits can be justified based on clinical grounds, and CVwR > 30% is not caused by ‘outliers’.3 12 How does that compare? y <- data.frame(design = design, method = "ABE", metric = c("AUC", "Cmax", "Cmin"), CV = c(0.15, 0.20, 0.35), theta0 = c(0.95, 1.05, 0.90), L = 0.8, U = 1.25, n = NA_integer_, power = NA_real_, stringsAsFactors = FALSE) for (i in 1:nrow(y)) { if (y$metric[i] == "AUC" | y$CV[i] <= 0.3) { y[i, 8:9] <- sampleN.TOST(CV = y$CV[i], theta0 = y$theta0[i], design = design, print = FALSE, details = FALSE)[7:8] if (y$n[i] < 12) {# minimum acc. to the GL
y$n[i] <- 12 y$power[i] <- power.TOST(CV = y$CV[i], theta0 = y$theta0[i],
design = design, n = y$n[i]) } }else { y$method[i] <- "ABEL"
y[i, 6:7]   <- scABEL(CV = y$CV[i]) y[i, 8:9] <- sampleN.scABEL(CV = y$CV[i], theta0 = y$theta0[i], design = design, print = FALSE, details = FALSE)[8:9] } } y$L     <- sprintf("%.2f%%", 100 * y$L) # cosmetics y$U     <- sprintf("%.2f%%", 100 * y$U) y$power <- signif(y$power, 4) names(y)[6:7] <- c("L ", "U ") print(y, row.names = FALSE) # design method metric CV theta0 L U n power # 2x2x4 ABE AUC 0.15 0.95 80.00% 125.00% 12 0.9881 # 2x2x4 ABE Cmax 0.20 1.05 80.00% 125.00% 12 0.9085 # 2x2x4 ABEL Cmin 0.35 0.90 77.23% 129.48% 34 0.8118 Which approach is optimal is a case-to-case decision. Although in this example bracketing is the ‘winner’ (26 subjects instead of 34), it might be problematic if a CV is larger and/or a T/R-ratio worse than assumed: CV of AUC 0.15 → 0.20, Cmax 0.20 → 0.25, Cmin 0.35 → 0.50; T/R-ratio of AUC 0.95 → 0.90, Cmax 1.05 → 1.12, Cmin 0.90 → 0.88. n <- max(y$n)
z <- data.frame(approach = c("ABE", "Non-Superiority", "ABE",
"Non-Inferiority", "ABE"),
metric = c("AUC", rep(c("Cmax", "Cmin"), each = 2)),
CV = c(0.2, rep(c(0.25, 0.50), each = 2)),
theta0 = c(0.90, rep(c(1.12, 0.88), each = 2)),
margin =  c(NA, 1.25, NA, 0.80, NA),
L = c(0.80, NA, 0.80, NA, 0.80),
U = c(1.25, NA, 1.25, NA, 1.25),
n = n, power = NA_real_,
stringsAsFactors = FALSE)
for (i in 1:nrow(z)) {
if (z$approach[i] %in% c("Non-Superiority", "Non-Inferiority")) { z$power[i] <- power.noninf(design  = design,
alpha   = 0.05,
margin  = z$margin[i], theta0 = z$theta0[i],
CV      = z$CV[i], n = z$n[i])
}else {
if (z$CV[i] <= 0.3) { z$power[i]    <- power.TOST(design  = design,
theta0  = z$theta0[i], CV = z$CV[i],
n       = z$n[i]) }else { z$approach[i] <- "ABEL"
z[i, 6:7]     <- scABEL(CV = z$CV[i]) z$power[i]    <- power.scABEL(design  = design,
theta0  = z$theta0[i], CV = z$CV[i],
n       = z$n[i]) } } } z$L      <- sprintf("%.2f%%", 100 * z$L) # cosmetics z$U      <- sprintf("%.2f%%", 100 * z$U) z$power  <- signif(z$power, 4) z$margin <- sprintf("%.2f", z$margin) z$margin[z$margin == "NA"] <- "– " z$L[z$L == "NA%"] <- "– " z$U[z$U == "NA%"] <- "– " names(z)[6:7] <- c("L ", "U ") print(z, row.names = FALSE) # approach metric CV theta0 margin L U n power # ABE AUC 0.20 0.90 – 80.00% 125.00% 34 0.9640 # Non-Superiority Cmax 0.25 1.12 1.25 – – 34 0.8258 # ABE Cmax 0.25 1.12 – 80.00% 125.00% 34 0.8258 # Non-Inferiority Cmin 0.50 0.88 0.80 – – 34 0.3169 # ABEL Cmin 0.50 0.88 – 69.84% 143.19% 34 0.8183 • Non-Superiority / Non-Inferiority We will pass Cmax (note that its power equals the one of ABE) but fail13 Cmin. If your software does not support one-sided tests (i.e., gives only a two-sided CI), use the upper reported CL for Non-Superiority and the lower CL for Non-Inferiority. • ABEL / ABE Although we yet have to assess Cmax by ABE (CVwR < 30%), it is not ‘overpowered’ any more. In reference-scaling by ABEL Cmin will still pass due to more expansion of the limits (69.84% – 143.19% for CVwR 50% instead of 77.23% – 129.48% for CVwR 35%). Hence, in this case the equivalence approach by ABE(L) is the ‘winner’ because it tolerates more deviations from assumptions. What happens if you fail to convince the agency that ABEL is acceptable? The picture changes. a <- data.frame(design = design, method = "ABE", metric = c("AUC", "Cmax", "Cmin"), CV = c(0.15, 0.20, 0.35), theta0 = c(0.95, 1.05, 0.90), L = 0.8, U = 1.25, n = NA_integer_, power = NA_real_, stringsAsFactors = FALSE) for (i in 1:nrow(a)) { a[i, 8:9] <- sampleN.TOST(CV = a$CV[i], theta0 = a$theta0[i], design = design, print = FALSE, details = FALSE)[7:8] if (a$n[i] < 12) {# minimum acc. to the GL
a$n[i] <- 12 a$power[i] <- power.TOST(CV = a$CV[i], theta0 = a$theta0[i],
design = design, n = a$n[i]) } } a$L     <- sprintf("%.2f%%", 100 * a$L) # cosmetics a$U     <- sprintf("%.2f%%", 100 * a$U) a$power <- signif(a$power, 4) names(a)[6:7] <- c("L ", "U ") print(a, row.names = FALSE) # design method metric CV theta0 L U n power # 2x2x4 ABE AUC 0.15 0.95 80.00% 125.00% 12 0.9881 # 2x2x4 ABE Cmax 0.20 1.05 80.00% 125.00% 12 0.9085 # 2x2x4 ABE Cmin 0.35 0.90 80.00% 125.00% 52 0.8003 Nasty – we need a ≈53% larger sample size. If all values turn out to be as worse as above: b <- data.frame(design = design, method = "ABE", metric = c("AUC", "Cmax", "Cmin"), CV = c(0.20, 0.25, 0.50), theta0 = c(0.90, 1.12, 0.88), L = 0.8, U = 1.25, n = NA_integer_, power = NA_real_, stringsAsFactors = FALSE) for (i in 1:nrow(a)) { b[i, 8:9] <- sampleN.TOST(CV = b$CV[i], theta0 = b$theta0[i], design = design, print = FALSE, details = FALSE)[7:8] if (b$n[i] < 12) {# minimum acc. to the GL
b$n[i] <- 12 b$power[i] <- power.TOST(CV = b$CV[i], theta0 = b$theta0[i],
design = design, n = b$n[i]) } } b$L     <- sprintf("%.2f%%", 100 * b$L) # cosmetics b$U     <- sprintf("%.2f%%", 100 * b$U) b$power <- signif(b$power, 4) names(b)[6:7] <- c("L ", "U ") print(b, row.names = FALSE) # design method metric CV theta0 L U n power # 2x2x4 ABE AUC 0.20 0.90 80.00% 125.00% 18 0.8007 # 2x2x4 ABE Cmax 0.25 1.12 80.00% 125.00% 32 0.8050 # 2x2x4 ABE Cmin 0.50 0.88 80.00% 125.00% 154 0.8038 End of the story. Recall that this is a study in a 2-treatment, 2-sequence, 4-period full replicate design. c <- data.frame(design = "2x2x4", approach = c("ABE", "Non-Superiority", "Non-Inferiority"), metric = c("AUC", "Cmax", "Cmin"), margin = c(NA, 1.25, 0.80), CV = c(0.20, 0.25, 0.50), theta0 = c(0.90, 1.12, 0.88), n = NA_integer_, power = NA_real_, stringsAsFactors = FALSE) for (i in 1:nrow(c)) { if (c$approach[i] == "ABE") {# ABE
c[i, 7:8] <- sampleN.TOST(CV = c$CV[i], theta0 = c$theta0[i],
design = c$design[i], print = FALSE, details = FALSE)[7:8] if (c$n[i] < 12) {# minimum acc. to the GL
c$n[i] <- 12 c$power[i] <- power.TOST(CV = c$CV[i], theta0 = c$theta0[i],
design = c$design[i], n = c$n[i])
}
}else {                     # Non-Inferiority, Non-Superiority
c[i, 7:8] <- sampleN.noninf(alpha = 0.05, CV = c$CV[i], margin = c$margin[i], theta0  = c$theta0[i], design = c$design[i], details = FALSE,
print = FALSE)[6:7]
if (c$n[i] < 12) {# minimum acc. to GLs c$n[i]     <- 12
c$power[i] <- power.noninf(alpha = 0.05, CV = c$CV[i],
margin = c$margin[i], theta0 = c$theta0[i],
design = c$design[i], n = c$n[i])
}
}
}
c$power <- signif(c$power, 4) # cosmetics
c$margin <- sprintf("%.2f", c$margin)
c$margin[c$margin == "NA"] <- "–  "
print(c, row.names = FALSE)
#  design        approach metric margin   CV theta0   n  power
#   2x2x4             ABE    AUC    –   0.20   0.90  18 0.8007
#   2x2x4 Non-Superiority   Cmax   1.25 0.25   1.12  32 0.8050
#   2x2x4 Non-Inferiority   Cmin   0.80 0.50   0.88 154 0.8038

As an aside, we would need also require 154 subjects to demonstrate Non-Inferiority of Cmin in the bracketing approach. Perhaps it is readily more economic to opt for a clinical trial… Helmut Schütz 2022
R, PowerTOST, and arsenal GPL 3.0, pandoc GPL 2.0.
1st version July 24, 2022. Rendered August 08, 2022 13:31 CEST by rmarkdown via pandoc in 0.75 seconds.

Footnotes and References

1. Labes D, Schütz H, Lang B. PowerTOST: Power and Sample Size for (Bio)Equivalence Studies. Package version 1.5.4. 2022-02-21. CRAN.↩︎

2. Labes D, Schütz H, Lang B. Package ‘PowerTOST’. February 21, 2022. CRAN.↩︎

3. EMA, CHMP. Guideline on the pharmacokinetic and clinical evaluation of modified release dosage forms. London. 20 November 2014. EMA/CPMP/EWP/280/96 Corr1. Online.↩︎

4. Zhang P. A Simple Formula for Sample Size Calculation in Equivalence Studies. J Biopharm Stat. 2003; 13(3): 529–38. doi:10.1081/BIP-120022772.↩︎

5. Senn S. Guernsey McPearson’s Drug Development Dictionary. April 21, 2020. Online.↩︎

6. Hoenig JM, Heisey DM. The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. Am Stat. 2001; 55(1): 19–24. doi:10.1198/000313001300339897. Open Access.↩︎

7. In short: There is no statistical method to ‘correct’ for unequal carryover. It can only be avoided by design, i.e., a sufficiently long washout between periods. According to the guidelines subjects with pre-dose concentrations > 5% of their Cmax can by excluded from the comparison if stated in the protocol.↩︎

8. Especially important for drugs which are auto-inducers or -inhibitors and biologics.↩︎

9. Senn S. Statistical Issues in Drug Development. Chichester: John Wiley; 2nd ed 2007.↩︎

10. It depends on both the within- and between-subject variances. In general the latter is larger than the former (see above).↩︎

11. ‘The Guy in the Armani suit’ (© ElMaestro, introduced there) is a running gag in the BEBA Forum. He (occasionally she) is only proficient in Powerpoint, copypasting from one document to an other, and shouting »You are Fired!« if a study fails.↩︎

12. EMA, CHMP. Guideline on the Investigation of Bioequivalence. CPMP/EWP/QWP/1401/98 Rev. 1/ Corr. London. 20 January 2010. Online.↩︎

13. Any power < 50% is a failure by definition.↩︎