Consider allowing JavaScript. Otherwise, you have to be proficient in reading LaTeX since formulas will not be rendered. Furthermore, the table of contents in the left column for navigation will not be available and code-folding not supported. Sorry for the inconvenience.

Examples in this article were generated with R 4.0.5 by the package PowerTOST.1

See also the README on GitHub for an overview and the online manual2 for details and a collection of other articles.

  • The right-hand badges give the ‘level’ of the respective section.
    
  1. Basics about sample size methodology – requiring no or only limited statistical expertise.
    
  1. These sections are the most important ones. They are – hopefully – easily comprehensible even for novices.
    
  1. A somewhat higher knowledge of statistics and/or R is required. May be skipped or reserved for a later reading.
    
  1. An advanced knowledge of statistics and/or R is required. Not recommended for beginners in particular.
    
  1. If you are not a neRd or statistics afficionado, skipping is recommended. Suggested for experts but might be confusing for others.
  • Click to show / hide R code.
Abbreviation Meaning
BE Bioequivalence
CV (Within-subject) Coefficient of Variation
H0 Null hypothesis
H1 Alternative hypothesis (also Ha)
TOST Two One-Sided Tests

Introduction

    

What is a significant treatment effect and do we have to care about one?

Sometimes regulatory assessors ask for the ‘justification’ of a significant treatment in an equivalence trial.

I will try to clarify why such a justification is futile and – a bit provocative – asking for one demonstrates a lack of understanding the underlying statistical concepts.

    

All examples deal with the 2×2×2 Crossover (RT|TR) but are applicable to any kind of Crossover (Higher-Order, Replicate Designs) as well. A basic knowledge of R does not hurt.

Preliminaries

    

A basic knowledge of R is required. To run the scripts at least version 1.4.3 (2016-11-01) of PowerTOST is suggested. Any version of R would likely do, though the current release of PowerTOST was only tested with version 3.6.3 (2020-02-29) and later.

All examples deal with the 2×2×2 Crossover Design but are applicable to any kind of equivalence study.

previous section ↩︎

Terminology

    

In order to get prospective power (and hence, a sample size), we need five values:

  1. The level of the test (in BE commonly 0.05),
  2. the BE-margins (commonly 0.80 – 1.25),
  3. the desired (or target) power,
  4. the variance (commonly expressed as a coefficient of variation), and
  5. the deviation of the test from the reference treatment.

1 – 2 are fixed by the agency,
3 is set by the sponsor (commonly to 0.80 – 0.90), and
4 – 5 are just (uncertain!) assumptions.

In other words, obtaining a sample size is not an exact calculation but always just an estimation.

Realization: Observations (in a sample) of a random variable (of the population).

It is extremely unlikely that all assumptions will be exactly realized in a particular study. If the realized values differ from the assumptions (i.e., T more deviating from R, and/or CV lower, and/or fewer dropouts than anticipated), the chance to observe a statistically significant treatment effect increases.

previous section ↩︎

Model

    

Most agencies (like the EMA) require an ANOVA of \(\small{\log_{e}}\) transformed responses, i.e., a linear model where all effects are fixed. In R:

m <- lm(log(Y) ~ sequence + subject%in%sequence +
                 period + treatment, data = data)

Other agencies (FDA, Health Canada) require a mixed-effects model where sequence, period, and treatment are fixed effects and subject(sequence) is a random effect.3

Let us first recap the hypotheses in bioequivalence.

The ‘Two One-Sided Tests Procedure’ (TOST)4 \[\begin{matrix}\tag{1} H_\textrm{0L}:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}\leq\theta_1\:vs\:H_\textrm{1L}:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}>\theta_1\\ H_\textrm{0U}:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}\geq\theta_2\:vs\:H_\textrm{1U}:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}<\theta_2 \end{matrix}\]

The confidence interval inclusion approach \[H_0:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}\ni\left\{\theta_1,\theta_2\right\}\:vs\:H_1:\theta_1<\frac{\mu_\textrm{T}}{\mu_\textrm{R}}<\theta_2\tag{2}\]

Note that the null-hypotheses imply bioinequivalence where \(\small{[\theta_1,\theta_2]}\) are the lower and upper limits of the bioequivalence range.
TOST provides two \(\small{p}\) values (where \(\small{H_0}\) is not rejected if \(\small{\max}\,[p_\textrm{L},p_\textrm{U}]>\alpha\)) and is of historical interest only because the CI inclusion approach is preferred in regulatory guidelines.

The limits \(\small{\left\{\theta_1,\theta_2\right\}}\) are based on the clinically not relevant difference \(\small{\Delta}\), which is commonly set to 0.20. For NTIDs (EMA and other jurisdictions) \(\small{\Delta}\) is 0.10, and for Cmax (Russian Federation, EEU, GCC) \(\small{\Delta}\) is 0.25. \[\left\{\theta_1=100(1-\Delta),\,\theta_2=100(1-\Delta)^{-1}\right\}\tag{3}\] For the \(\small{\Delta}\)s mentioned above: \[\begin{matrix}\tag{4} \left\{\theta_1=80.00\%,\,\theta_2=125.00\%\right\}\\ \left\{\theta_1=90.00\%,\,\theta_2=111.1\dot{1}\%\right\}\\ \left\{\theta_1=75.00\%,\,\theta_2=133.3\dot{3}\%\right\} \end{matrix}\]

As long as the \(\small{100(1-2\,\alpha)}\) confidence interval lies entirely within the relevant pre-specified \(\small{\left\{\theta_1,\theta_2\right\}}\), the null-hypothesis is rejected and the alternative hypothesis of equivalence accepted \(\small{(2)}\). Neither the location of the point estimate \(\small{\theta_0}\) nor the width of the CI play any role in this decison.

Since \(\small{(1)}\) and \(\small{(2)}\) are operationally equivalent, it follows that if one of the \(\small{p}\) values of \(\small{(1)}\) < \(\small{\alpha}\), the \(\small{100(1-2\,\alpha)}\) CI does not include 100%.

    

That means, the treatments differ statistically significantly although this difference is not clinically relevant.

Asking for the ‘justification’ of a statistical significant treatment difference contradicts the accepted principles laid down in guidelines since the 1980s.

With a sufficiently large sample size any treatment with a \(\small{\theta_0\neq100\%}\) will show a statistical significant difference.5

Examples

    

Throughout the examples I’m dealing with studies in a 2×2×2 Crossover Design. Of course, the same logic is applicable for any other as well.

library(PowerTOST) # attach it to run the examples
up2even <- function(n) {
  return(as.integer(2 * (n %/% 2 + as.logical(n %% 2))))
}
nadj <- function(n, do) {
  return(as.integer(up2even(n / (1 - do))))
}

Single endpoint

    

Say, the assumed CV was 0.15, the T/R-ratio 0.95, and we planned the study for power ≥0.90 and an anticipated a dropout-rate of 0.15. More subjects than estimated were dosed (quite often the management decides). In a small survey a whooping 37% of respondents reported that.6

CV     <- 0.15
target <- 0.90
do     <- 0.15
theta0 <- pe <- TR <- 0.95
n      <- sampleN.TOST(CV = CV, targetpower = target,
                       theta0 = theta0,
                       print = FALSE)[["Sample size"]]
n      <- nadj(n, do) # adjust for droputs
ns     <- n:(n*2.5)
res1   <- data.frame(n = ns, CL.lo = NA, CL.hi = NA,
                     p.lo = NA, p.hi = NA,
                     post.hoc = NA)
# as planned
for (i in seq_along(ns)) {
  res1[i, 2:3] <- 100*CI.BE(CV = CV, pe = pe,
                            n = ns[i])
  res1[i, 4:5] <- suppressMessages(
                    pvalues.TOST(CV = CV, pe = pe,
                                 n = ns[i]))
  res1[i, 6]   <- suppressMessages(
                    power.TOST(CV = CV, n = ns[i],
                               theta0 = theta0))
}
windows(width = 4.5, height = 4.5)
op <- par(no.readonly = TRUE)
par(mar = c(4.1, 4, 0, 0), cex.axis = 0.9)
plot(ns, rep(100, length(ns)), ylim = c(80, 125),
     type = "n", axes = FALSE, log = "y",
     xlab = "sample size",
     ylab = "T/R-ratio  (90% CI)")
axis(1, at = seq(n, tail(ns, 1), 6))
axis(2, at = c(80, 90, 100, 110, 120, 125), las = 1)
grid(nx = NA, ny = NULL)
abline(v = seq(n, tail(ns, 1), 6), col = "lightgray", lty = 3)
abline(h = c(80, 100, 125), lty = 2,
       col = c("red", "black", "red"))
abline(v = head(res1$n[res1$CL.hi < 100], 1), lty = 2)
segments(x0 = ns[1], x1 = tail(ns, 1),
         y0 = 100*TR, col = "blue")
lines(res1$n, res1$CL.lo, type = "s", col = "blue", lwd = 2)
lines(res1$n, res1$CL.hi, type = "s", col = "blue", lwd = 2)
box()
par(op)

Fig. 1 Realized values = assumptions.

We face a significant treatment effect with 48 subjects (upper confidence limit 99.98%) or more.

Similar to above but the management was more relaxed (24 subjects dosed).

theta0 <- 0.95
target <- 0.90
do     <- 0.15
TR     <- 0.92
n      <- sampleN.TOST(CV = CV, targetpower = target,
                       theta0 = theta0,
                       print = FALSE)[["Sample size"]]
n.adj  <- nadj(n, do) # adjust for droputs
n.hi   <- up2even(n.adj * 1.2)
ns     <- n:n.hi
res2   <- data.frame(n = ns, CL.lo = NA, CL.hi = NA,
                     p.lo = NA, p.hi = NA,
                     post.hoc = NA)
for (i in seq_along(ns)) {
  res2[i, 2:3] <- 100*CI.BE(CV = CV, pe = TR,
                           n = ns[i])
  res1[i, 4:5] <- suppressMessages(
                    pvalues.TOST(CV = CV, pe = TR,
                                 n = ns[i]))
  res1[i, 6]   <- suppressMessages(
                    power.TOST(CV = CV, n = ns[i],
                               theta0 = TR))
}
windows(width = 4.5, height = 4.5)
op <- par(no.readonly = TRUE)
par(mar = c(4.1, 4, 0, 0), cex.axis = 0.9)
plot(ns, rep(100, length(ns)), ylim = c(80, 125),
     type = "n", axes = FALSE, log = "y",
     xlab = "sample size",
     ylab = "T/R-ratio  (90% CI)")
axis(1, at = seq(n, tail(ns, 1), 2))
axis(2, at = c(80, 90, 100, 110, 120, 125), las = 1)
grid(nx = NA, ny = NULL)
abline(v = seq(n, tail(ns, 1), 2), col = "lightgray", lty = 3)
abline(h = c(80, 100, 125), lty = 2,
       col = c("red", "black", "red"))
abline(v = head(res2$n[res2$CL.hi < 100], 1), lty = 2)
segments(x0 = ns[1], x1 = tail(ns, 1),
         y0 = 100*TR, col = "blue")
lines(res2$n, res2$CL.lo, type = "s", col = "blue", lwd = 2)
lines(res2$n, res2$CL.hi, type = "s", col = "blue", lwd = 2)
box()
par(op)

Fig. 2 Worse T/R-ratio.

This time the T/R-ratio turned out to be worse (0.92 instead of the assumed 0.95). We face a significant treatment effect with 20 subjects (upper confidence limit 99.84%) or more.

We had to deal with a drug with low variability. The assumed CV was 0.10, the T/R-ratio 0.95, and we planned the study for power ≥0.80. Theoretically we would need only 8 (eight!) subjects but the minimum sample size according to the guidelines is 12. We increased the sample size for an anticipated a dropout-rate of 0.15.

CV     <- 0.10
target <- 0.80
theta0 <- 0.95
TR     <- 0.935
do     <- 0.15
n      <- sampleN.TOST(CV = CV, targetpower = target,
                       theta0 = theta0,
                       print = FALSE)[["Sample size"]]
if (n < 12) n <- 12   # acc. to GL
n.adj  <- nadj(n, do) # adjust for droputs
ns     <- n:n.adj
res3   <- data.frame(n = ns, CL.lo = NA, CL.hi = NA,
                     p.lo = NA, p.hi = NA,
                     post.hoc = NA)
for (i in seq_along(ns)) {
  res3[i, 2:3] <- 100*CI.BE(CV = CV, pe = TR,
                           n = ns[i])
  res1[i, 4:5] <- suppressMessages(
                    pvalues.TOST(CV = CV, pe = TR,
                                 n = ns[i]))
  res3[i, 6]   <- suppressMessages(
                    power.TOST(CV = CV, n = ns[i],
                               theta0 = TR))
}
windows(width = 4.5, height = 4.5)
op <- par(no.readonly = TRUE)
par(mar = c(4.1, 4, 0, 0), cex.axis = 0.9)
plot(ns, rep(100, length(ns)), ylim = c(80, 125),
     type = "n", axes = FALSE, log = "y",
     xlab = "sample size",
     ylab = "T/R-ratio  (90% CI)")
axis(1, at = ns)
axis(2, at = c(80, 90, 100, 110, 120, 125), las = 1)
grid(nx = NA, ny = NULL)
abline(v = ns, col = "lightgray", lty = 3)
abline(h = c(80, 100, 125), lty = 2,
       col = c("red", "black", "red"))
abline(v = head(res3$n[res3$CL.hi < 100], 1), lty = 2)
segments(x0 = ns[1], x1 = tail(ns, 1),
         y0 = 100*TR, col = "blue")
lines(res3$n, res3$CL.lo, type = "s", col = "blue", lwd = 2)
lines(res3$n, res3$CL.hi, type = "s", col = "blue", lwd = 2)
box()
par(op)

Fig. 3 Low CV, worse T/R-ratio.

The T/R-ratio turned out to be slightly worse (0.935 instead of the assumed 0.95). Already with 14 subjects we face a significant treatment effect (upper confidence limit 99.998%). This ‘nasty’ value will disappear due to rounding but will remain in the output of the ANOVA.
Drugs with a low CV regularly show a significant treatment effect, since following the guidelines leads to ‘overpowered’ studies. Already with 12 subjects we have a post hoc power of 0.972 (though we planned only for 0.80).

top of section ↩︎ previous section ↩︎

Multiple endpoint(s)

    

It is not unusal that equivalence of more than one endpoint has to be demonstrated. In bioequivalence the pharmacokinetic metrics Cmax and AUC0–t are mandatory (in some jurisdictions like the FDA additionally AUC0–∞).

We don’t have to worry about multiplicity issues (inflated Type I Error) since if all tests must pass at level \(\alpha\), we are protected by the intersection-union principle.7 8

We design the study always for the worst case combination, i.e., based on the PK metric requiring the largest sample size.

Let us explore a simple example. The assumed CV of Cmax is 0.25 and the one of AUC is lower (say the variance ratio is 0.70). We assume a T/R-ratio of 0.95 for both, aiming at power ≥ 0.80. The anticipated dropout-rate is 0.10.

opt <- function(x) {
  suppressMessages(
    power.TOST(theta0 = x, CV = res4$CV[2],
             n = n.est)) - target
}
metrics <- c("Cmax", "AUC")
ratio   <- 0.70          # variance ratio (AUC / Cmax)
CV.Cmax <- 0.25          # CV of Cmax
CV.AUC  <- mse2CV(CV2mse(CV.Cmax)*ratio)
CV      <- signif(c(CV.Cmax, CV.AUC))
theta0  <- 0.95          # both metrics
target  <- 0.80          # target (desired) power
do.rate <- 0.10          # anticipated dropout-rate 10%
res4    <- data.frame(metric = metrics, theta0 = theta0,
                      CV = CV, n = NA, power1 = NA,
                      nadj = NA, power2 = NA)
# sample sizes for both metrics
# study sample size based one the one with
# higher CV and adjusted for the dropout-rate
for (i in 1:nrow(res4)) {
  res4[i, 4:5] <- sampleN.TOST(CV = CV[i],
                               theta0 = theta0,
                               targetpower = target,
                               print = FALSE)[7:8]
}
res4$nadj   <- nadj(max(res4$n), do.rate)
res4$power1 <- res4$power1
for (i in 1:nrow(res4)) {
  res4[i, 7] <- power.TOST(CV = CV[i], theta0 = theta0,
                           n = res4$nadj[i])
}
res4[, c(5, 7)] <- signif(res4[, c(5, 7)], 4)
names(res4)[c(5, 7)] <- c("pwr (n)", "pwr (nadj)")
res5   <- data.frame(n = max(res4$nadj):max(res4$n),
                     PE = theta0, CL.hi1 = NA,
                     PE.lo = NA, CL.hi2 = NA)
for (i in 1:nrow(res5)) {
  n.est <- res5$n[i]
  if (theta0 < 1) {
    res5$PE.lo[i] <- uniroot(opt, tol = 1e-8,
                             interval =
                               c(0.80 + 1e-4,
                                 theta0))$root
  } else {
    res5$PE.lo[i] <- uniroot(opt, tol = 1e-8,
                             interval =
                               c(theta0,
                                 1.25 - 1e-4))$root
  }
  res5[i, 3] <- CI.BE(CV = CV[2], pe = theta0,
                      n = res5$n[i])[["upper"]]
  res5[i, 5] <- CI.BE(CV = CV[2],
                      pe = res5$PE.lo[i],
                      n = res5$n[i])[["upper"]]
}
res5[, 2:5] <- round(100*res5[, 2:5], 2)
names(res5)[c(3, 5)] <- rep("upper CL", 2)
print(res4, row.names = F);cat("AUC:\n");print(res5, row.names = F)
R>  metric theta0       CV  n pwr (n) nadj pwr (nadj)
R>    Cmax   0.95 0.250000 28  0.8074   32     0.8573
R>     AUC   0.95 0.208208 20  0.8057   32     0.9467
R> AUC:
R>   n PE upper CL PE.lo upper CL
R>  32 95   103.68 91.20    99.53
R>  31 95   103.83 91.41    99.91
R>  30 95   104.00 91.62   100.29
R>  29 95   104.17 91.85   100.71
R>  28 95   104.35 92.08   101.14

Due to its lower CV we would need only 20 subjects for AUC. However, for Cmax we need 28. We perform the study in 32 subjects (adjusted for the dropout-rate). Consequently, the study is ‘overpowered’ for AUC (~0.95 instead of ~0.81 with 20 subjects).

The supportive function opt() provides extreme point estimates of AUC which will still give our target power (only the lower one is given in the 4th column). If this value is realized in the study, its upper confidence limit will not include 100% if we have not at least two droputs.

In an particular study the point estimate may be even lower and/or the CV whilst the study still passes. Then we will get a significant treatment effect with more dropouts.

Such a situation is quite common and the further the CVs of PK metrics are apart, the more often we will face it.

    

Although the above is straightforward and based on elementary statistics, below an R-script to perform simulations.

Say we assume a CV of CVmax (0.25), base the sample size on it (taking the dropout rate into account) and the CV of AUC is lower but unknown. How often can we expect a significant treatment effect for a range of CVs (here 0.25 down to 0.15)?

# Cave: Long runtime!
balance <- function(n, sequences) {
  return (as.integer(sequences *
                     (n %/% sequences + as.logical(n %% sequences))))
}
adjust.dropouts <- function(n, do.rate) {
  return (as.integer(balance(n / (1 - do.rate), sequences = 2)))
}
set.seed(123456)
nsims   <- 1e4L # number of simulations
target  <- 0.80 # target power
PE      <- 0.95 # assumed PE (both metrics)
CV.Cmax <- 0.25 # assumed CV of Cmax
CV.AUC  <- c(0.25, 0.20, 0.15)
do.rate <- 0.1  # anticipated dropout-rate (10%)
CV.do   <- 0.15 # assumed CV of the dropout-rate (15%)
tmp     <- sampleN.TOST(CV = CV.Cmax, theta0 = PE,
                        targetpower = target,
                        details = FALSE, print = FALSE)
n.des   <- tmp[["Sample size"]]
if (n.des >= 12) {
  power.Cmax <- tmp[["Achieved power"]]
} else { # GL!
  n.des <- 12
  power.Cmax <- power.TOST(CV = CV.Cmax, theta0 = PE, n = n.des)
}
power.AUC <- numeric()
for (j in seq_along(CV.AUC)) {
  power.AUC[j] <- power.TOST(CV = CV.AUC[j], theta0 = PE, n = n.des)
}
n.adj     <- adjust.dropouts(n = n.des, do.rate = do.rate)
res.Cmax  <- data.frame(CV = rep(NA, nsims), n = NA, PE = NA,
                        lower = NA, upper = NA, BE = FALSE,
                        signif = FALSE)
post.Cmax <- data.frame(sim = 1:nsims)
res.AUC   <- data.frame(CV.ass = rep(CV.AUC, each = nsims),
                        CV = rep(NA, nsims), n = NA, PE = NA,
                        lower = NA, upper = NA, BE = FALSE,
                        signif = FALSE)
post.AUC  <- data.frame(CV.ass = rep(CV.AUC, each = nsims),
                        sim =1:nsims*length(CV.AUC))
for (j in 1:nsims) {
  do                <- rlnorm(1, meanlog = log(do.rate) - 0.5*CV2mse(CV.do),
                                 sdlog = sqrt(CV2mse(CV.do)))
  res.Cmax$n[j]     <- as.integer(round(n.des * (1 - do)))
  res.Cmax$CV[j]    <- mse2CV(CV2mse(CV.Cmax) *
                       rchisq(1, df = res.Cmax$n[j] - 2)/(res.Cmax$n[j] - 2))
  res.Cmax$PE[j]    <- exp(rnorm(1, mean = log(PE),
                                    sd = sqrt(0.5 / res.Cmax$n[j]) *
                                    sqrt(CV2mse(CV.Cmax))))
  res.Cmax[j, 4:5]  <- round(CI.BE(CV = res.Cmax$CV[j],
                                   pe = res.Cmax$PE[j],
                                   n = res.Cmax$n[j]), 4)
  if (res.Cmax$lower[j] >= 0.80 & res.Cmax$upper[j] <= 1.25) {
    res.Cmax$BE[j] <- TRUE
    if (res.Cmax$lower[j] > 1 | res.Cmax$upper[j] < 1)
      res.Cmax$signif[j] <- TRUE
  }
}
i <- 0
for (k in seq_along(CV.AUC)) {
  for (j in 1:nsims) {
    i <- i + 1
    res.AUC$n[i]     <- res.Cmax$n[j]
    res.AUC$CV[i]    <- mse2CV(CV2mse(CV.AUC[k]) *
                        rchisq(1, df = res.AUC$n[i] - 2)/(res.AUC$n[i] - 2))
    res.AUC$PE[i]    <- exp(rnorm(1, mean = log(PE),
                                     sd = sqrt(0.5 / res.AUC$n[i]) *
                                     sqrt(CV2mse(CV.AUC[k]))))
    res.AUC[i, 5:6]  <- round(CI.BE(CV = res.AUC$CV[i],
                                    pe = res.AUC$PE[i],
                                    n = res.AUC$n[i]), 4)
    if (res.AUC$lower[i] >= 0.80 & res.AUC$upper[i] <= 1.25) {
      res.AUC$BE[i] <- TRUE
      if (res.AUC$lower[i] > 1 | res.AUC$upper[i] < 1)
        res.AUC$signif[i] <- TRUE
    }
  }
}
passed.Cmax <- sum(res.Cmax$BE)
passed.AUC  <- numeric(length(CV.AUC))
for (j in seq_along(CV.AUC)) {
  passed.AUC[j] <- sum(res.AUC$BE[res.AUC$CV.ass == CV.AUC[j]])
}
txt <- paste("Assumed CV (Cmax)    :", sprintf("%.4f", CV.Cmax),
    "\nAssumed CVs (AUC)    :", paste(sprintf("%.4f", CV.AUC),
                                      collapse = ", "),
    "\nAssumed PE           :", sprintf("%.4f", PE),
    "\nTarget power         :", sprintf("%.4f", target),
    "\nSample size          :", n.des, "(based on Cmax)",
    "\nAchieved power (Cmax):", sprintf("%.4f", power.Cmax),
    "\nAchieved powers (AUC):", paste(sprintf("%.4f", power.AUC),
                                      collapse = ", "),
    "\nDosed                :", n.adj,
    sprintf("(anticip. dropout-rate %g)", do.rate),
    "\n ", formatC(nsims, format = "d", big.mark = ","),
    "simulated 2\u00D72\u00D72 studies",
    "\n  n:", min(res.Cmax$n), "\u2013", max(res.Cmax$n),
    sprintf("(median %g)", median(res.Cmax$n)),
    "\n  Cmax", sprintf("(%.4f)", CV.Cmax),
    "\n    CV       :",
    sprintf("%6.4f \u2013 %6.4f", min(res.Cmax$CV), max(res.Cmax$CV)),
    sprintf("(median %7.4f)", exp(median(log(res.Cmax$CV)))),
    "\n    PE       :",
    sprintf("%6.4f \u2013 %6.4f", min(res.Cmax$PE), max(res.Cmax$PE)),
    sprintf("(g. mean%7.4f)", exp(mean(log(res.Cmax$PE)))),
    "\n    100% not within CI (stat. significant):",
    sprintf("%5.2f%%", 100*sum(res.Cmax$signif)/passed.Cmax), "\n")
for (j in seq_along(CV.AUC)) {
  txt <- paste(txt, "  AUC", sprintf("(%.4f)", CV.AUC[j]),
    "\n    CV       :",
    sprintf("%6.4f \u2013 %6.4f",
      min(res.AUC$CV[res.AUC$CV.ass == CV.AUC[j]]),
      max(res.AUC$CV[res.AUC$CV.ass == CV.AUC[j]])),
    sprintf("(median %7.4f)",
      exp(median(log(res.AUC$CV[res.AUC$CV.ass == CV.AUC[j]])))),
    "\n    PE       :",
    sprintf("%6.4f \u2013 %6.4f",
      min(res.AUC$PE[res.AUC$CV.ass == CV.AUC[j]]),
      max(res.AUC$PE[res.AUC$CV.ass == CV.AUC[j]])),
    sprintf("(g. mean%7.4f)",
      exp(mean(log(res.AUC$PE[res.AUC$CV.ass == CV.AUC[j]])))),
    "\n    100% not within CI (stat. significant):",
    sprintf("%5.2f%%",
      100*sum(res.AUC$signif[res.AUC$CV.ass == CV.AUC[j]])/passed.AUC[j]),
    "\n")
}
cat(txt)
R> Assumed CV (Cmax)    : 0.2500 
R> Assumed CVs (AUC)    : 0.2500, 0.2000, 0.1500 
R> Assumed PE           : 0.9500 
R> Target power         : 0.8000 
R> Sample size          : 28 (based on Cmax) 
R> Achieved power (Cmax): 0.8074 
R> Achieved powers (AUC): 0.8074, 0.9349, 0.9946 
R> Dosed                : 32 (anticip. dropout-rate 0.1) 
R>   10,000 simulated 2×2×2 studies 
R>   n: 23 – 26 (median 25) 
R>   Cmax (0.2500) 
R>     CV       : 0.1199 – 0.4127 (median  0.2468) 
R>     PE       : 0.8314 – 1.0736 (g. mean 0.9502) 
R>     100% not within CI (stat. significant):  2.30% 
R>    AUC (0.2500) 
R>     CV       : 0.1248 – 0.4099 (median  0.2462) 
R>     PE       : 0.8373 – 1.0814 (g. mean 0.9491) 
R>     100% not within CI (stat. significant):  2.70% 
R>    AUC (0.2000) 
R>     CV       : 0.1008 – 0.3302 (median  0.1969) 
R>     PE       : 0.8506 – 1.0458 (g. mean 0.9501) 
R>     100% not within CI (stat. significant):  7.96% 
R>    AUC (0.1500) 
R>     CV       : 0.0831 – 0.2663 (median  0.1481) 
R>     PE       : 0.8722 – 1.0216 (g. mean 0.9497) 
R>     100% not within CI (stat. significant): 19.96%

What does that mean? You name it.

top of section ↩︎ previous section ↩︎

Conclusion

Coming back to the questions asked in the introduction. To repeat:

    What is a significant treatment effect … ?

    
  • It is a natural property of a test at level α. If a study passes, it does not refute its conclusion (statistical significance ≠ clinical relevance).

    … and do we have to care about one?

    
  • If you are asked by a regulator for a ‘justification’, answer in a diplomatic way.

previous section ↩︎

License

CC BY 4.0 Helmut Schütz 2021
1st version March 18, 2021.
Rendered 2021-04-08 11:09:48 CEST by rmarkdown in 0.51 seconds.

Footnotes and References


  1. Labes D, Schütz H, Lang B. PowerTOST: Power and Sample Size for (Bio)Equivalence Studies. 2021-01-18. CRAN.↩︎

  2. Labes D, Schütz H, Lang B. Package ‘PowerTOST’. January 18, 2021. CRAN.↩︎

  3. Unfortunately due to different ‘design philosophies’ the SAS-code given by the FDA cannot be translated to R.↩︎

  4. Schuirmann DJ. A Comparison of the Two One-Sided Tests Procedure and the Power Approach for Assessing the Equivalence of Average Bioavailability. J Pharmacokin Biopharm. 1987; 15(6): 657–80. doi:10.1007/BF01068419.↩︎

  5. Stupid example: CV = 10% (NTID), n = 120, 4-period full replicate design, \(\small{\theta_0=98.5\%}\) → 90% CI 97.03–99.99%, \(\small{p}\) 5·10–72↩︎

  6. Schütz H. Sample Size Estimation in Bioequivalence. Evaluation. 2020-10-23. BEBA Forum.↩︎

  7. Berger RL, Hsu JC. Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets. Stat Sci. 1996; 11(4): 283–302. JSTOR:2246021.↩︎

  8. Zeng A. The TOST confidence intervals and the coverage probabilities with R simulation. March 14, 2014..↩︎