Consider allowing JavaScript. Otherwise, you have to be proficient in reading LaTeX since formulas will not be rendered. Furthermore, the table of contents in the left column for navigation will not be available and code-folding not supported. Sorry for the inconvenience.


Examples in this article were generated with R 4.3.1 by the package PowerTOST.1 See also its Online manual2 for details.

  • The right-hand badges give the respective section’s ‘level’.
    
  1. Basics about sample size methodology – requiring no or only limited statistical expertise.
    
  1. These sections are the most important ones. They are – hopefully – easily comprehensible even for novices.
    
  1. A somewhat higher knowledge of statistics and/or R is required. May be skipped or reserved for a later reading.
  • Click to show / hide R code.
  • To copy R code to the clipboard click on the icon copy icon in the top left corner.
Abbreviation Meaning
\(\small{\alpha}\) Nominal level of the test, probability of Type I Error, patient’s risk
\(\small{\beta}\) Probability of Type II Error, producer’s risk
(A)BE (Average) Bioequivalence
ABEL Average Bioequivalence with Expanding Limits
CI, CL Confidence Interval, Limit
\(\small{CV}\) Coefficient of Variation
\(\small{CV_\textrm{inter}}\) Between-subject Coefficient of Variation
\(\small{CV_\textrm{intra}}\) Within-subject Coefficient of Variation
\(\small{CV_\textrm{wR}}\) Within-subject Coefficient of Variation of the Reference treatment
\(\small{\delta}\) Margin of clinical relevance in Non-Inferiority/Superiority and Non-Superiority
\(\small{H_0}\) Null hypothesis
\(\small{H_1}\) Alternative hypothesis (also \(\small{H_\textrm{a}}\))
L, U Lower and upper limits in ABE(L)
\(\small{\mu_\text{T},\,\mu_\text{R}}\) True mean of the Test and Reference treatment, respectively
\(\small{\pi}\) Prospective power (\(\small{1-\beta}\))
TOST Two One-Sided Tests

Introduction

    

What are the main statistical issues in planning a confirmatory experiment?

For details about inferential statistics and hypotheses see another article.

    

An ‘optimal’ study design is one, which – taking all assumptions into account – has a reasonably high chance of demonstrating non-inferiority or non-superiority (power) whilst controlling the patient’s risk.

Contrary to Bioequivalence (BE), where a study is assessed with \(\small{\alpha=0.05}\) by TOST (or by a \(\small{100\,(1-2\times0.05)}\) Confidence Interval), in Non-Inferiority and Non-Superiority a single one-sided test with \(\small{\alpha=0.025}\) is employed.

Based on a ‘clinically relevant margin’ \(\small{\delta}\) we have different hypotheses.

Non-Inferiority/Superiority

We assume that higher responses are better.3 4 If data follow a lognormal distribution the hypotheses are \[H_0:\frac{\mu_\text{T}}{\mu_\text{R}}\leq \log_{e}\delta\;vs\;H_1:\frac{\mu_\text{T}}{\mu_\text{R}}>\log_{e}\delta\tag{1a}\]


Fig. 1 \(\small{\delta=0.8}\) (x-axis in log-scale).

If data follow a normal distribution the hypotheses are \[H_0:\mu_\text{T}-\mu_\text{R}\leq \delta\;vs\;H_1:\mu_\text{T}-\mu_\text{R}>\delta\tag{1b}\]

Applications:

  • Clinical phase III trials comparing a new treatment with placebo or an established treatment (efficacy).
  • Comparing minimum concentrations of a new MR formulation with the ones of an approved IR formulation as a surrogate of efficacy.5

Non-Superiority

We assume that lower responses are better. If data follow a lognormal distribution the hypotheses are \[H_0:\frac{\mu_\text{T}}{\mu_\text{R}}\geq \log_{e}\delta\;vs\;H_1:\frac{\mu_\text{T}}{\mu_\text{R}}<\log_{e}\delta\tag{1a}\]


Fig. 2 \(\small{\delta=1.25}\) (x-axis in log-scale).

If data follow a normal distribution the hypotheses are \[H_0:\mu_\text{T}-\mu_\text{R}\geq \delta\;vs\;H_1:\mu_\text{T}-\mu_\text{R}<\delta\tag{1b}\]

Applications:

  • Clinical phase III trials comparing Adverse Effects of a new treatment with placebo or an established treatment (safety).
  • Comparing maximum concentrations of a new MR formulation with the ones of an approved IR formulation as a surrogate of safety.5

top of section ↩︎

Preliminaries

    

A basic knowledge of R is required. To run the scripts at least version 1.4.9 (2019-12-19) of PowerTOST is suggested. Any version of R would likely do, though the current release of PowerTOST was only tested with version 4.1.3 (2022-03-10) and later.
All scripts were run on a Xeon E3-1245v3 @ 3.40GHz (1/4 cores) 16GB RAM with R 4.3.1 on Windows 7 build 7601, Service Pack 1, Universal C Run­time 10.0.10240.16390.

Note that in the functions sampleN.noninf() and power.noninf()the assumed coefficient of variation CV has to be given as a ratio and not in percent. If the analysis is based on lognormal data by \(\small{(1\text{a})}\) or \(\small{(2\text{a})}\), the assumed theta0 and margin \(\small{\delta}\) (margin) have to be given as ratios and not in percent. If the analysis is based on normal data by \(\small{(1\text{b})}\) or \(\small{(2\text{b})}\), theta0 andmargin have to be given with the original value. Data have to be continuous on a ratio scale, either lognormal \(\small{\left(x\in\mathbb{R}^{+}=\{0<x\leq\infty\}\right)}\) or normal \(\small{\left(x\in\mathbb{R}=\{-\infty\leq x\leq+\infty\}\right)}\) distributed.
Count data (e.g., events), rates (0 – 1) and percentages, as well as ordinal data (e.g., tmax) are not supported.

sampleN.noninf() gives balanced sequences for crossover designs (i.e., the same number of subjects is allocated to all sequences) or equal group sizes in a parallel design. Furthermore, the estimated sample size is the total number of subjects, not subjects per sequence or treatment arm – like in some other software packages. The sample size functions of PowerTOST use a modification of Zhang’s method6 based on the large sample approximation as the starting value of the iterations.

Most examples deal with studies where the response variables follow a lognormal distribution, i.e., we assume a multiplicative model (ratios instead of differences). We work with \(\small{\log_{e}}\)-transformed data in order to allow analysis by the t-test (requiring differences). This is the default in most functions of PowerTOST and hence, the argument logscale = TRUE does not need to be specified.

previous section ↩︎

Terminology

    

It may sound picky but ‘sample size calculation’ (as used in most guidelines and alas, in some publications and textbooks) is sloppy terminology. In order to get prospective power (and hence, a sample size), we need five values:

  1. The level of the test \(\small{\alpha}\) (in Non-Superiority / Non-Inferiority commonly 0.025),
  2. the clinicall relevant margin \(\small{\delta}\),
  3. the desired (or target) power \(\small{\pi}\),
  4. the variance (commonly expressed as a coefficient of variation), and
  5. the deviation of the test from the reference treatment.

1 – 2 are fixed by the agency,
3 is set by the sponsor, and
4 – 5 are just (uncertain!) assumptions.

In other words, obtaining a sample size is not an exact calculation like \(\small{2\times2=4}\) but always just an estimation.

Power Calculation – A guess masquerading as mathematics.
Stephen Senn (2020)7
Realization: Ob­ser­vations (in a sample) of a random variable (of the population).

Of note, it is extremely unlikely that all assumptions will be exactly realized in a particular study. Hence, calculating retrospective (a.k.a. post hoc, a posteriori) power is not only futile but plain nonsense.8

Since generally the within-subject variability is lower than the between-subject variability, crossover studies are so popular. The efficiency of a crossover study compared to a parallel study is given by \(\small{\frac{\sigma_\textrm{intra}^2\;+\,\sigma_\textrm{inter}^2}{0.5\,\times\,\sigma_\textrm{intra}^2}}\). If, say, \(\small{\sigma_\textrm{intra}^2=0.5\times\sigma_\textrm{inter}^2}\) in a paralled study we need six times as many subjects than in a crossover to obtain the same power. On the other hand, in a crossover we have two measurements per subject, which makes the parallel study approximately three times more costly.

Note that there is no relationship between \(\small{CV_\textrm{intra}}\) and \(\small{CV_\textrm{inter}}\). An example are drugs which are subjected to polymorphic metabolism. For these drugs \(\small{CV_\textrm{intra}\ll CV_\textrm{inter}}\). On the other hand, some HVD(P)s show \(\small{CV_\textrm{intra}>CV_\textrm{inter}}\).

Carryover: A resi­dual effect of a previous period.

It is a prerequisite that no carryover from one period to the next exists. Only then the comparison of treatments will be unbiased. For details see another article.9 Subjects have to be in the same physiological state10 throughout the study – guaranteed by a sufficiently long wash­out phase. Crossover studies cannot only be performed in healthy volunteers but also in patients with a stable disease (e.g., asthma). Studies in patients with an instable disease (e.g., in oncology) must be performed in a parallel design.
If crossovers are not feasible (e.g., for drugs with a very long half life), studies could be performed in a parallel design as well.

top of section ↩︎ previous section ↩︎

Power → Sample size

    

The sample size cannot be directly estimated,
only power calculated for an already given sample size.

The power equations cannot be re-arranged to solve for sample size.

Power. That which statisticians are always calculating but never have.
Stephen Senn (2007)11

top of section ↩︎ previous section ↩︎

Examples

library(PowerTOST) # attach it to run the examples
    

Note that in Non-Inferiority and Non-Superiority – contrary to other functions of the package – a one-sided t-test (instead of TOST aiming at equivalence) is employed.
Throughout the examples I’m referring to studies in a single center – not multiple groups within them or multicenter studies. That’s another cup of tea.

Most methods of PowerTOST are based on pairwise comparisons. It is up to you to adjust the level of the test alpha if you want to compare more (e.g., two test treatments vs a reference) in order to avoid inflation of the family-wise error rate due to multiplicity.

Say, we want to demonstrate Non-Inferiority in a 2×2×2 crossover-design, assume a CV of 25%, a T/R-ratio of 0.95, \(\small{\delta}\) 0.8, and target a power of at least 0.80.
Since alpha = 0.025, theta0 = 0.95, margin = 0.8, targetpower = 0.8, design = "2x2", and logscale = TRUE are defaults of the function we don’t have to give them explicitly.

sampleN.noninf(CV = 0.25)
# 
# ++++++++++++ Non-inferiority test +++++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover 
# log-transformed data (multiplicative model)
# 
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 0.8
# True ratio = 0.95,  CV = 0.25
# 
# Sample size (total)
#  n     power
# 36   0.820330

If you want to perform the analysis with untransformed data, specify logscale = FALSE. Then the defaults are theta0 = -0.05 and margin = -0.2.

sampleN.noninf(CV = 0.25, logscale = FALSE)
# 
# ++++++++++++ Non-inferiority test +++++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover 
# untransformed data (additive model)
# 
# alpha = 0.025, target power = 0.8
# Non-inf. margin = -0.2
# True diff. = -0.05,  CV = 0.25
# 
# Sample size (total)
#  n     power
# 46   0.803507

Let’s return to lognormal distributed data because that’s more common.
Say, you have information from a pilot study that the treatment performs really (i.e., 30%) better than placebo. You are cautious (good idea!), and assume a lower T/R-ratio and a higher CV than the observed 1.30 and 0.25.

sampleN.noninf(CV = 0.28, theta = 1.25, margin = 1)
# 
# ++++++++++++ Non-inferiority test +++++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover 
# log-transformed data (multiplicative model)
# 
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 1
# True ratio = 1.25,  CV = 0.28
# 
# Sample size (total)
#  n     power
# 26   0.802234

What about a parallel design? Likely the CV will be substantially higher.12

sampleN.noninf(CV = 0.50, theta = 1.25, margin = 1, design = "parallel")
# 
# ++++++++++++ Non-inferiority test +++++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups 
# log-transformed data (multiplicative model)
# 
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 1
# True ratio = 1.25,  CV = 0.5
# 
# Sample size (total)
#  n     power
# 144   0.803753

I hear the ‘Guy in the Armani suit’13 shouting »C’mon, 72 subjects / arm, who shall pay for that? Hey, we have the wonder-drug! It works twice as good as snake oil!«

sampleN.noninf(CV = 0.50, theta = 2, margin = 1, design = "parallel")
# 
# ++++++++++++ Non-inferiority test +++++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups 
# log-transformed data (multiplicative model)
# 
# alpha = 0.025, target power = 0.8
# Non-inf. margin = 1
# True ratio = 2,  CV = 0.5
# 
# Sample size (total)
#  n     power
# 18   0.831844

Cross fingers that the drug performs really that great. If it is actually just 60% better than snake oil, power with this sample will be only ≈51%. Master of disaster…

Possibly the ‘Guy in the Armani suit’ has read about ‘allocation ratios’ in the COVID-19 vaccination trials and asks »Why should we treat as many patients with snake oil than with our wonder-drug?«

    

Let’s see.

round.up <- function(n, alloc) {
  return(as.integer(alloc * (n %/% alloc + as.logical(n %% alloc))))
}
CV      <- 0.50 # Total (pooled) CV
theta0  <- 2    # Assumed T/R-ratio
margin  <- 1    # Non-Inferiority margin
target  <- 0.8  # Target (desired) power
alloc.T <- 3    # Allocation of wonder-drug (T)
alloc.R <- 1    # Allocation of snake oil (R)
# conventional 1:1
tmp     <- sampleN.noninf(CV = CV, theta0 = theta0, margin = margin,
                          design = "parallel", targetpower = target,
                          print = FALSE)
n.0     <- as.integer(tmp[["Sample size"]])
pwr.0   <- tmp[["Achieved power"]]
# 3:1 allocation (naïve)
n.1     <- setNames(c(round.up(n.0 / (alloc.T + alloc.R) * alloc.T, alloc.T),
                      round.up(n.0 / (alloc.T + alloc.R) * alloc.R, alloc.R)),
                    c("Test", "Reference"))
pwr.1   <- power.noninf(CV = CV, theta0 = theta0, margin = margin,
                        n = n.1, design = "parallel")
# 3:1 allocation (preserving power)
n.2     <- n.1
repeat {# increase the sample size if necessary
  pwr.2 <- power.noninf(CV = CV, theta0 = theta0, margin = margin,
                        n = n.2, design = "parallel")
  if (pwr.2 >= target) break
  n.2[["Test"]]      <- as.integer(n.2[["Test"]] + alloc.T)
  n.2[["Reference"]] <- as.integer(n.2[["Reference"]] + alloc.R)
}
fmt <- paste0("%", nchar(as.character(n.0)), ".0f")
cat("\n++++++++++++ Non-inferiority test +++++++++++++",
    "\n            Sample size estimation",
    "\n-----------------------------------------------",
    "\nStudy design: 2 parallel groups",
    "\nlog-transformed data (multiplicative model)",
    "\n\nalpha = 0.025, target power =", target,
    "\nNon-inf. margin =", margin,
    paste0("\nTrue ratio = ", theta0, ", CV = ", CV),
    "\n\nTotal sample size =", n.0, "(1:1 allocation)",
    paste0("\n  n (T) = ", sprintf(fmt, n.0/2),
    ", n (R) = ", sprintf(fmt, n.0/2),
    ": power = ", signif(pwr.0, 6)),
    "\nTotal sample size =", sum(n.1),
    "(naïve", paste0(alloc.T, ":", alloc.R, " allocation)"),
    "penalty", sprintf("%.0f%%", 100*(sum(n.1)/n.0-1)),
    paste0("\n  n (T) = ", sprintf(fmt, n.1[["Test"]]),
    ", n (R) = ", sprintf(fmt, n.1[["Reference"]]),
    ": power = ", signif(pwr.1, 6)),
    "change", sprintf("%+.2f%%", 100 * (pwr.1 - pwr.0) / pwr.0),
    "\nTotal sample size =", sum(n.2),
    paste0("(", alloc.T, ":", alloc.R, " allocation)"),
    sprintf("%13s %.0f%%", "penalty", 100*(sum(n.2)/n.0-1)),
    paste0("\n  n (T) = ", sprintf(fmt, n.2[["Test"]]),
    ", n (R) = ", sprintf(fmt, n.2[["Reference"]]),
    ": power = ", signif(pwr.2, 6)),
    "change", sprintf("%+.2f%%", 100 * (pwr.2 - pwr.0) / pwr.0), "\n")
# 
# ++++++++++++ Non-inferiority test +++++++++++++ 
#             Sample size estimation 
# ----------------------------------------------- 
# Study design: 2 parallel groups 
# log-transformed data (multiplicative model) 
# 
# alpha = 0.025, target power = 0.8 
# Non-inf. margin = 1 
# True ratio = 2, CV = 0.5 
# 
# Total sample size = 18 (1:1 allocation) 
#   n (T) =  9, n (R) =  9: power = 0.831844 
# Total sample size = 20 (naïve 3:1 allocation) penalty 11% 
#   n (T) = 15, n (R) =  5: power = 0.766496 change -7.86% 
# Total sample size = 24 (3:1 allocation)       penalty 33% 
#   n (T) = 18, n (R) =  6: power = 0.844798 change +1.56%

Already in the naïve 3:1 allocation you have to round the sample size up because the 18 of the 1:1 allocation is not a mutiple of 4. Nevertheless, you loose 7.86% power. In order to preserve power, you have to increase the sample size further.
However, it’s still based on a strong belief in the performance of the wonder-drug. If it again turns out to be just 60% better than snake oil, power with 24 subjects in the 3:1 allocation will be only ≈52%. Hardly better than tossing a coin.


    

A special case: Bracketing approach (EMA)

Compare a new MR formulation (regimen once a day) with an IR formulation (twice a day). Cmax is the surrogate target metric for safety (Non-Superiority) and Cmin the surrogate for efficacy (Non-Infe­ri­o­rity):

[…] therapeutic studies might be waived [if …]:
  • there is a well-defined therapeutic window in terms of safety and efficacy, the rate of input is known not to influence the safety and efficacy profile or the risk for tolerance development and
    • bioequivalence between the reference and the test product is shown in terms of AUC(0-τ),ss and
    • Cmax,ss for the new MR formulation is below or equivalent to the Cmax,ss for the approved formulation and Cmin,ss for the MR formulation is above or equi­valent to the Cmin,ss of the approved formulation.
EMA (2014)5

Although not explicitly stated in the guideline, AFAIK the EMA expects tests at \(\small{\alpha=0.05}\).

Margins are 1.25 for Cmax and 0.80 for Cmin. We assume CVs of 0.15 for AUC, 0.20 for Cmax, 0.35 for Cmin, T/R-ratios of 0.95 for AUC and Cmin and 1.05 for Cmax. We plan the study in a 2-treatment, 2-sequence, 4-period full replicate design due to the high variability of Cmin.
Which PK metric leads the sample size in such a Bioequivalence (AUC) / Non-Superiority (Cmax) / Non-In­fe­ri­ority (Cmin)  study?

design <- "2x2x4"
x      <- data.frame(design = "2x2x4", metric = c("AUC", "Cmax", "Cmin"),
                     margin = c(NA, 1.25, 0.80), CV = c(0.15, 0.20, 0.35),
                     theta0 = c(0.95, 1.05, 0.95), n = NA_integer_,
                     power = NA_real_, stringsAsFactors = FALSE)
for (i in 1:nrow(x)) {
  if (x$metric[i] == "AUC") {# ABE
    x[i, 6:7] <- sampleN.TOST(design  = design,
                              theta0  = x$theta0[i],
                              CV      = x$CV[i],
                              details = FALSE,
                              print   = FALSE)[7:8]
    if (x$n[i] < 12) {# minimum acc. to GLs
      x$n[i]     <- 12
      x$power[i] <- power.TOST(design  = design,
                              theta0  = x$theta0[i],
                              CV      = x$CV[i],
                              n       = x$n[i])
    }
}else {                # Non-Inferiority, Non-Superiority
    x[i, 6:7] <- sampleN.noninf(design  = design,
                                alpha   = 0.05,
                                margin  = x$margin[i],
                                theta0  = x$theta0[i],
                                CV      = x$CV[i],
                                details = FALSE,
                                print   = FALSE)[6:7]
    if (x$n[i] < 12) {# minimum acc. to GLs
      x$n[i]     <- 12
      x$power[i] <- power.noninf(design  = design,
                                alpha   = 0.05,
                                margin  = x$margin[i],
                                theta0  = x$theta0[i],
                                CV      = x$CV[i],
                                n       = x$n[i])
    }
  }
}
x$power  <- signif(x$power, 4) # cosmetics
x$margin <- sprintf("%.2f", x$margin)
x$margin[x$margin == "NA"] <- "–  "
print(x, row.names = FALSE)
cat(paste0("Sample size lead by ", x$metric[x$n == max(x$n)], ".\n"))
#  design metric margin   CV theta0  n  power
#   2x2x4    AUC    –   0.15   0.95 12 0.9881
#   2x2x4   Cmax   1.25 0.20   1.05 12 0.9098
#   2x2x4   Cmin   0.80 0.35   0.95 26 0.8184
# Sample size lead by Cmin.

However, with 26 subjects to show Non-Inferiority of Cmin the study is ‘overpowered’ for BE of AUC and Non-Superiority of Cmax.

cat("Power with", max(x$n), "subjects for",
    "\nAUC :",
    power.TOST(design = design, CV = x$CV[1],
               theta0 = x$theta0[1], n = max(x$n)),
    "\nCmax:",
    power.noninf(design = design, alpha = 0.05, margin = 1.25,
             CV = x$CV[3], theta0 = x$theta0[3], n = max(x$n)), "\n")
# Power with 26 subjects for 
# AUC : 0.9999851 
# Cmax: 0.9974663

That gives us some space to navigate for e.g., Cmax if values turn out to be ‘worse’ (say, CV 0.20 → 0.25, T/R-ratio 1.05 → 1.10):

power.noninf(design = design, alpha = 0.05, margin = x$margin[3],
             CV = 0.25, theta0 = 1.10, n = max(x$n)) # higher CV, worse theta0
# [1] 0.8359967

The bracketing approach may require a lower sample size than required for demonstrating BE with the common CI-inclusion approach for all PK metrics, which is another option mentioned in the guideline.3 Note that reference-scaling by ABEL is acceptable for Cmax and Cmin if their CVwR >30%, expanding the limits can be justified based on clinical grounds, and CVwR > 30% is not caused by ‘outliers’.3 14 How does that compare?

y <- data.frame(design = design, method = "ABE",
                metric = c("AUC", "Cmax", "Cmin"),
                CV = c(0.15, 0.20, 0.35),
                theta0 = c(0.95, 1.05, 0.90),
                L = 0.8, U = 1.25, n = NA_integer_,
                power = NA_real_, stringsAsFactors = FALSE)
for (i in 1:nrow(y)) {
  if (y$metric[i] == "AUC" | y$CV[i] <= 0.3) {
    y[i, 8:9] <- sampleN.TOST(CV = y$CV[i], theta0 = y$theta0[i],
                              design = design, print = FALSE,
                              details = FALSE)[7:8]
    if (y$n[i] < 12) {# minimum acc. to the GL
      y$n[i]     <- 12
      y$power[i] <- power.TOST(CV = y$CV[i], theta0 = y$theta0[i],
                               design = design, n = y$n[i])
    }
}else {
    y$method[i] <- "ABEL"
    y[i, 6:7]   <- scABEL(CV = y$CV[i])
    y[i, 8:9]   <- sampleN.scABEL(CV = y$CV[i], theta0 = y$theta0[i],
                                  design = design, print = FALSE,
                                  details = FALSE)[8:9]
  }
}
y$L     <- sprintf("%.2f%%", 100 * y$L) # cosmetics
y$U     <- sprintf("%.2f%%", 100 * y$U)
y$power <- signif(y$power, 4)
names(y)[6:7] <- c("L   ", "U   ")
print(y, row.names = FALSE)
#  design method metric   CV theta0   L       U     n  power
#   2x2x4    ABE    AUC 0.15   0.95 80.00% 125.00% 12 0.9881
#   2x2x4    ABE   Cmax 0.20   1.05 80.00% 125.00% 12 0.9085
#   2x2x4   ABEL   Cmin 0.35   0.90 77.23% 129.48% 34 0.8118

Which approach is optimal is a case-to-case decision. Although in this example bracketing is the ‘winner’ (26 subjects instead of 34), it might be problematic if a CV is larger and/or a T/R-ratio worse than assumed: CV of AUC 0.15 → 0.20, Cmax 0.20 → 0.25, Cmin 0.35 → 0.50; T/R-ratio of AUC 0.95 → 0.90, Cmax 1.05 → 1.12, Cmin 0.90 → 0.88.

n <- max(y$n)
z <- data.frame(approach = c("ABE", "Non-Superiority", "ABE",
                             "Non-Inferiority", "ABE"),
                metric = c("AUC", rep(c("Cmax", "Cmin"), each = 2)),
                CV = c(0.2, rep(c(0.25, 0.50), each = 2)),
                theta0 = c(0.90, rep(c(1.12, 0.88), each = 2)),
                margin =  c(NA, 1.25, NA, 0.80, NA),
                L = c(0.80, NA, 0.80, NA, 0.80),
                U = c(1.25, NA, 1.25, NA, 1.25),
                n = n, power = NA_real_,
                stringsAsFactors = FALSE)
for (i in 1:nrow(z)) {
  if (z$approach[i] %in% c("Non-Superiority", "Non-Inferiority")) {
    z$power[i] <- power.noninf(design  = design,
                               alpha   = 0.05,
                               margin  = z$margin[i],
                               theta0  = z$theta0[i],
                               CV      = z$CV[i],
                               n       = z$n[i])
}else {
    if (z$CV[i] <= 0.3) {
      z$power[i]    <- power.TOST(design  = design,
                                  theta0  = z$theta0[i],
                                  CV      = z$CV[i],
                                  n       = z$n[i])
  }else {
      z$approach[i] <- "ABEL"
      z[i, 6:7]     <- scABEL(CV = z$CV[i])
      z$power[i]    <- power.scABEL(design  = design,
                                    theta0  = z$theta0[i],
                                    CV      = z$CV[i],
                                    n       = z$n[i])
    }
  }
}
z$L      <- sprintf("%.2f%%", 100 * z$L) # cosmetics
z$U      <- sprintf("%.2f%%", 100 * z$U)
z$power  <- signif(z$power, 4)
z$margin <- sprintf("%.2f", z$margin)
z$margin[z$margin == "NA"] <- "–  "
z$L[z$L == "NA%"] <- "–   "
z$U[z$U == "NA%"] <- "–   "
names(z)[6:7]     <- c("L   ", "U   ")
print(z, row.names = FALSE)
#         approach metric   CV theta0 margin   L       U     n  power
#              ABE    AUC 0.20   0.90    –   80.00% 125.00% 34 0.9640
#  Non-Superiority   Cmax 0.25   1.12   1.25   –       –    34 0.8258
#              ABE   Cmax 0.25   1.12    –   80.00% 125.00% 34 0.8258
#  Non-Inferiority   Cmin 0.50   0.88   0.80   –       –    34 0.3169
#             ABEL   Cmin 0.50   0.88    –   69.84% 143.19% 34 0.8183
  • Non-Superiority / Non-Inferiority
    We will pass Cmax (note that its power equals the one of ABE) but fail15 Cmin.
    If your software does not support one-sided tests (i.e., gives only a two-sided CI), use the upper reported CL for Non-Superiority and the lower CL for Non-Inferiority.
  • ABEL / ABE
    Although we yet have to assess Cmax by ABE (CVwR < 30%), it is not ‘overpowered’ any more.
    In reference-scaling by ABEL Cmin will still pass due to more expansion of the limits (69.84% – 143.19% for CVwR 50% instead of 77.23% – 129.48% for CVwR 35%).

Hence, in this case the equivalence approach by ABE(L) is the ‘winner’ because it tolerates more deviations from assumptions.

What happens if you fail to convince the agency that ABEL is acceptable? The picture changes.

a <- data.frame(design = design, method = "ABE",
                metric = c("AUC", "Cmax", "Cmin"),
                CV = c(0.15, 0.20, 0.35),
                theta0 = c(0.95, 1.05, 0.90),
                L = 0.8, U = 1.25, n = NA_integer_,
                power = NA_real_, stringsAsFactors = FALSE)
for (i in 1:nrow(a)) {
   a[i, 8:9] <- sampleN.TOST(CV = a$CV[i], theta0 = a$theta0[i],
                             design = design, print = FALSE,
                             details = FALSE)[7:8]
   if (a$n[i] < 12) {# minimum acc. to the GL
     a$n[i]     <- 12
     a$power[i] <- power.TOST(CV = a$CV[i], theta0 = a$theta0[i],
                              design = design, n = a$n[i])
   }
}
a$L     <- sprintf("%.2f%%", 100 * a$L) # cosmetics
a$U     <- sprintf("%.2f%%", 100 * a$U)
a$power <- signif(a$power, 4)
names(a)[6:7] <- c("L   ", "U   ")
print(a, row.names = FALSE)
#  design method metric   CV theta0   L       U     n  power
#   2x2x4    ABE    AUC 0.15   0.95 80.00% 125.00% 12 0.9881
#   2x2x4    ABE   Cmax 0.20   1.05 80.00% 125.00% 12 0.9085
#   2x2x4    ABE   Cmin 0.35   0.90 80.00% 125.00% 52 0.8003

Nasty – we need a ≈53% larger sample size.

If all values turn out to be as worse as above:

b <- data.frame(design = design, method = "ABE",
                metric = c("AUC", "Cmax", "Cmin"),
                CV = c(0.20, 0.25, 0.50),
                theta0 = c(0.90, 1.12, 0.88),
                L = 0.8, U = 1.25, n = NA_integer_,
                power = NA_real_, stringsAsFactors = FALSE)
for (i in 1:nrow(a)) {
   b[i, 8:9] <- sampleN.TOST(CV = b$CV[i], theta0 = b$theta0[i],
                             design = design, print = FALSE,
                             details = FALSE)[7:8]
   if (b$n[i] < 12) {# minimum acc. to the GL
     b$n[i]     <- 12
     b$power[i] <- power.TOST(CV = b$CV[i], theta0 = b$theta0[i],
                              design = design, n = b$n[i])
   }
}
b$L     <- sprintf("%.2f%%", 100 * b$L) # cosmetics
b$U     <- sprintf("%.2f%%", 100 * b$U)
b$power <- signif(b$power, 4)
names(b)[6:7] <- c("L   ", "U   ")
print(b, row.names = FALSE)
#  design method metric   CV theta0   L       U      n  power
#   2x2x4    ABE    AUC 0.20   0.90 80.00% 125.00%  18 0.8007
#   2x2x4    ABE   Cmax 0.25   1.12 80.00% 125.00%  32 0.8050
#   2x2x4    ABE   Cmin 0.50   0.88 80.00% 125.00% 154 0.8038

End of the story. Recall that this is a study in a 2-treatment, 2-sequence, 4-period full replicate design.

c <- data.frame(design = "2x2x4",
                approach = c("ABE", "Non-Superiority", "Non-Inferiority"),
                metric = c("AUC", "Cmax", "Cmin"),
                margin = c(NA, 1.25, 0.80), CV = c(0.20, 0.25, 0.50),
                theta0 = c(0.90, 1.12, 0.88), n = NA_integer_,
                power = NA_real_, stringsAsFactors = FALSE)
for (i in 1:nrow(c)) {
  if (c$approach[i] == "ABE") {# ABE
    c[i, 7:8] <- sampleN.TOST(CV = c$CV[i], theta0 = c$theta0[i],
                              design = c$design[i], print = FALSE,
                              details = FALSE)[7:8]
    if (c$n[i] < 12) {# minimum acc. to the GL
      c$n[i]     <- 12
      c$power[i] <- power.TOST(CV = c$CV[i], theta0 = c$theta0[i],
                               design = c$design[i], n = c$n[i])
   }
}else {                     # Non-Inferiority, Non-Superiority
    c[i, 7:8] <- sampleN.noninf(alpha = 0.05, CV = c$CV[i],
                                margin = c$margin[i], theta0  = c$theta0[i],
                                design = c$design[i], details = FALSE,
                                print = FALSE)[6:7]
    if (c$n[i] < 12) {# minimum acc. to GLs
      c$n[i]     <- 12
      c$power[i] <- power.noninf(alpha = 0.05, CV = c$CV[i],
                                margin = c$margin[i], theta0  = c$theta0[i],
                                design = c$design[i], n = c$n[i])
    }
  }
}
c$power  <- signif(c$power, 4) # cosmetics
c$margin <- sprintf("%.2f", c$margin)
c$margin[c$margin == "NA"] <- "–  "
print(c, row.names = FALSE)
#  design        approach metric margin   CV theta0   n  power
#   2x2x4             ABE    AUC    –   0.20   0.90  18 0.8007
#   2x2x4 Non-Superiority   Cmax   1.25 0.25   1.12  32 0.8050
#   2x2x4 Non-Inferiority   Cmin   0.80 0.50   0.88 154 0.8038

As an aside, we would need also require 154 subjects to demonstrate Non-Inferiority of Cmin in the bracketing approach. Perhaps it is readily more economic to opt for a clinical trial…


top of section ↩︎ previous section ↩︎

Q & A

  • Q: Can we use R in a regulated environment and is PowerTOST validated?
    A: See this document16 about the acceptability of Base R and its SDLC.17
    R is updated every couple of months with documented changes18 and maintaining a bug-tracking system.19 I re­commend to use always the latest release.
    The authors of PowerTOST tried to do their best to provide reliable and valid results. The package’s NEWS documents the development of the package, bug-fixes, and introduction of new methods. Issues are tracked at Git­Hub (as of today none is still open). So far the package had >102,000 downloads. Therefore, it is extremely unlikely that bugs were not detected given its large user base.
    The ultimate re­spon­si­bil­i­ty of validating any software (yes, of SAS as well…) lies in the hands of the user.20 21 22

  • Q: I still have questions. How to proceed?
    A: The preferred method is to register at the BEBA Forum and post your question in the category Power / Sample size  (please read the Forum’s Policy first).
    You can contact me at [email protected]. Be warned – I will charge you for anything beyond most basic questions.

previous section ↩︎

Licenses

CC BY 4.0 Helmut Schütz 2023
R, PowerTOST, and arsenal GPL 3.0, klippy MIT, pandoc GPL 2.0.
1st version July 24, 2022. Rendered June 21, 2023 10:44 CEST by rmarkdown via pandoc in 0.37 seconds.

Footnotes and References


  1. Labes D, Schütz H, Lang B. PowerTOST: Power and Sample Size for (Bio)Equivalence Studies. Package version 1.5.4.9000. 2022-04-25. CRAN.↩︎

  2. Labes D, Schütz H, Lang B. Package ‘PowerTOST’. February 21, 2022. CRAN.↩︎

  3. Chow S-C, Shao J, Wang H. Sample Size Calculations in Clinical Research. New York: Marcel Dekker; 2003. Chapter 3.↩︎

  4. Julious SA. Sample Sizes for Clinical Trials. Boca Raton: CRC Press; 2010. Chapter 4.↩︎

  5. EMA, CHMP. Guideline on the pharmacokinetic and clinical evaluation of modified release dosage forms. London. 20 November 2014. EMA/CPMP/EWP/280/96 Corr1. Online.↩︎

  6. Zhang P. A Simple Formula for Sample Size Calculation in Equivalence Studies. J Biopharm Stat. 2003; 13(3): 529–38. doi:10.1081/BIP-120022772.↩︎

  7. Senn S. Guernsey McPearson’s Drug Development Dictionary. April 21, 2020. Online.↩︎

  8. Hoenig JM, Heisey DM. The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. Am Stat. 2001; 55(1): 19–24. doi:10.1198/000313001300339897. Open Access Open Access.↩︎

  9. In short: There is no statistical method to ‘correct’ for unequal carryover. It can only be avoided by design, i.e., a sufficiently long washout between periods. According to the guidelines subjects with pre-dose concentrations > 5% of their Cmax can by excluded from the comparison if stated in the protocol.↩︎

  10. Especially important for drugs which are auto-inducers or -inhibitors and biologics.↩︎

  11. Senn S. Statistical Issues in Drug Development. Chichester: John Wiley; 2nd ed 2007.↩︎

  12. It depends on both the within- and between-subject variances. In general the latter is larger than the former (see above).↩︎

  13. ‘The Guy in the Armani suit’ (© ElMaestro, introduced there) is a running gag in the BEBA Forum. He (occasionally she) is only proficient in Powerpoint, copypasting from one document to an other, and shouting »You are Fired!« if a study fails.↩︎

  14. EMA, CHMP. Guideline on the Investigation of Bioequivalence. CPMP/EWP/QWP/1401/98 Rev. 1/ Corr. London. 20 January 2010. Online.↩︎

  15. Any power < 50% is a failure by definition.↩︎

  16. The R Foundation for Statistical Computing. A Guidance Document for the Use of R in Regulated Clinical Trial En­vi­ron­ments. Vienna. October 18, 2021. Online.↩︎

  17. The R Foundation for Statistical Computing. R: Software Development Life Cycle. A Description of R’s Development, Testing, Release and Maintenance Processes. Vienna. October 18, 2021. Online.↩︎

  18. The R Foundation. R News. 2023-06-16. Online.↩︎

  19. Bugzilla. R bug tracking system. Online.↩︎

  20. FDA. Statistical Software Clarifying Statement. May 6, 2015. Download.↩︎

  21. WHO. Guidance for organizations performing in vivo bioequivalence studies. Geneva. May 2016. Technical Report Series No. 996, Annex 9. Section 4. Online.↩︎

  22. ICH. Good Clinical Practice (GCP). E6(R3) – Draft. 19 May 2023. Section 4.5. Online.↩︎