Consider allowing JavaScript. Otherwise, you have to be proficient in reading LaTeX since formulas will not be rendered. Furthermore, the table of contents in the left column for navigation will not be available and code-folding not supported. Sorry for the inconvenience.

Examples in this article were generated with R 4.0.5 by the packages PowerTOST1 and randomizeBE.2

See also the README on GitHub for an overview and the online manual3 for details and a collection of other articles.

  • The right-hand badges give the respective section’s ‘level’.
    
  1. Basics about power and sample size methodology – requiring no or only limited statistical expertise.
    
  1. These sections are the most important ones. They are – hopefully – easily comprehensible even for novices.
    
  1. A somewhat higher knowledge of statistics and/or R is required. May be skipped or reserved for a later reading.
    
  1. An advanced knowledge of statistics and/or R is required. Not recommended for beginners in particular.
    
  1. If you are not a neRd or statistics afficionado, skipping is recommended. Suggested for experts but might be confusing for others.
  • Click to show / hide R code.

Introduction

    

What is the purpose of calculating post hoc power?

Sometimes regulatory assessors ask for the ‘justification’ of ‘low’ or ‘high’ power in an equivalence trial.

I will try to clarify why such a justification is futile and – a bit provocative – asking for one demonstrates a lack of understanding the underlying statistical concepts.

In bioinformatics – where thousands of test are performed simultaneously – post hoc power is a suitable tool to select promising hypotheses.4

In the case of a single test (like in bioequivalence) it has been extensively criticized.5 6 7 It is only justified if a study failed and may help in designing the next one.

top of section ↩︎

Preliminaries

    

A basic knowledge of R is required. To run the scripts at least version 1.4.3 (2016-11-01) of PowerTOST is suggested. Any version of R would likely do, though the current release of PowerTOST was only tested with version 3.6.3 (2020-02-29) and later. At least version 0.3.1 (2012-12-27) of randomizeBE is required.

library(PowerTOST)   # attach required ones
library(randomizeBE) # to run the examples
There is simple intuition behind results like these: If my car made it to the top of the hill, then it is powerful enough to climb that hill; if it didn’t, then it obviously isn’t powerful enough. Retrospective power is an obvious answer to a rather uninteresting question. A more meaningful question is to ask whether the car is powerful enough to climb a particular hill never climbed before; or whether a different car can climb that new hill. Such questions are prospective, not retrospective.

top of section ↩︎ previous section ↩︎

Background

    

In order to get prospective (a priori) power and hence, estimate a sample size, we need five values:

  1. The level of the test (in bioequivalence commonly α = 0.05),
  2. the BE-margins (commonly 80–125%),
  3. the desired (or target) power, where the producer’s risk β = 1 – power,
  4. the variance (commonly expressed as a coefficient of variation), and
  5. the deviation Δ of the test from the reference treatment.

1 – 2 are fixed by the agency,
3 is set by the sponsor (commonly to 80–90%), and
4 – 5 are – uncertain! – assumptions.

The European Note for Guidance of 2001 was specific in this respect.8 Unfortunately the current guideline reduced this section to the elastic clause »The number of subjects to be included in the study should be based on an appropriate sample size calculation«.9

However, all of that deals with the design of the study.

From a regulatory perspective the outcome of a comparative bioavailability study is dichotomous:
Either bioequivalence was demonstrated (the 90% confidence interval lies entirely within the pre-specified acceptance range) or not.

  • The patient’s risk is controlled (α ≤ 0.05).

  • Post hoc (a.k.a. a posteriori, retrospective) power and the producer’s risk β (1 – power) do not play any role in this decision, i.e., an agency should not even ask for it because β is independent from α.

Realization: Observations (in a sample) of a random variable (of the population).

Since it is extremely unlikely that all assumptions made in designing the study will be exactly realized, it is of limited value to assess post hoc power – unless the study failed and we want to design another one.

    

Let’s explore the properties of prospective power in an example. Assumed CV 0.25, T/R-ratio 0.95, targeted at 90% power in a 2×2×2 crossover design:

alpha  <- 0.05
CV     <- 0.25
theta0 <- 0.95
theta1 <- 0.80
theta2 <- 1/theta1
target <- 0.90
design <- "2x2x2"
n      <- sampleN.TOST(alpha = alpha, CV = CV,
                       theta0 = theta0,
                       theta1 = theta1, theta2 = theta2,
                       targetpower = target,
                       design = design,
                       print = FALSE)[["Sample size"]]
pe     <- sort(unique(c(seq(theta1, 1, length.out = 100),
                        1/seq(theta1, 1, length.out = 100))))
res1   <- data.frame(pe = pe, p.left = NA, p.right = NA,
                     power = NA)
for (i in 1:nrow(res1)) {
  res1[i, 2:3] <- pvalues.TOST(pe = pe[i],
                               theta1 = theta1,
                               theta2 = theta2,
                               CV = CV, n = n,
                               design = design)
  res1[i, 4]   <- power.TOST(alpha = alpha,
                             theta0 = pe[i],
                             theta1 = theta1,
                             theta2 = theta2,
                             CV = CV, n = n,
                             design = design)
}
clr    <- c("#FF0000A0", "#FF00FFA0", "#0000FFA0")
dev.new(width = 4.5, height = 4.5)
op     <- par(no.readonly = TRUE)
par(mar = c(4, 4, 0, 0))
plot(pe, res1$power, type = "n", ylim = c(0, 1), log = "x",
     axes = FALSE, xlab = "", ylab = "")
abline(v = c(theta0, 1/theta0),
       h = c(alpha, 0.5), lty = 2)
grid()
box()
axis(1, at = c(seq(theta1, 1.2, 0.1), theta2))
axis(1, at = c(theta1, theta0, 1/theta0, theta2),
     tick = FALSE, line = 2,
     labels = c(expression(theta[1]),
                expression(theta[0]),
                expression(1/theta[0]),
                expression(theta[2])))
axis(2, las = 1)
mtext(expression(italic(p)* " / power"), 2, line = 3)
lines(pe, res1$p.left, col = clr[1], lwd = 3)
lines(pe, res1$p.right, col = clr[2], lwd = 3)
lines(pe, res1$power, col = clr[3], lwd = 3)
legend("center", box.lty = 0, bg = "white", y.intersp = 1.2,
       legend = c(expression(italic(p)>=theta[1]),
                  expression(italic(p)<=theta[2]),
                  "power"), col = clr, lwd = 3, seg.len = 3)
par(op)

Fig. 1 Power curves for n = 38.

With the assumed \(\small{\theta_0}\) of 0.9500 (and its reciprocal of 1.0526) we achieve the target power. If \(\small{\theta_0}\) will be closer to 1, we gain power. With \(\small{\theta_0}\) at the BE-margins \(\small{\left\{\theta_1,\theta_2\right\}}\) it will lead to a false decision (probability \(\small{\alpha=0.05}\) of the Type I Error). Note that at the margins the respective p-value of the TOST10 is 0.5 (like tossing a coin).

We can reverse this game and ask at which T/R-ratios power will be just 0.5 (with more deviating ones the chance to demonstrate BE is futile).

opt <- function(x) {
  power.TOST(theta0 = x, alpha = alpha,
             theta1 = theta1, theta2 = theta2,
             design = design,
             CV = CV, n = n) - 0.5
}
alpha  <- 0.05
CV     <- 0.25
theta0 <- 0.95
theta1 <- 0.80
theta2 <- 1/theta1
target <- 0.90
design <- "2x2x2"
n      <- sampleN.TOST(alpha = alpha, CV = CV,
                       theta0 = theta0,
                       theta1 = theta1, theta2 = theta2,
                       targetpower = target,
                       design = design,
                       print = FALSE)[["Sample size"]]
pe     <- sort(unique(c(seq(theta1, 1, length.out = 100),
                        1/seq(theta1, 1, length.out = 100))))
res1   <- data.frame(pe = pe, p.left = NA, p.right = NA,
                     power = NA)
for (i in 1:nrow(res1)) {
  res1[i, 2:3] <- pvalues.TOST(pe = pe[i],
                               theta1 = theta1,
                               theta2 = theta2,
                               CV = CV, n = n,
                               design = design)
  res1[i, 4]   <- power.TOST(alpha = alpha,
                             theta0 = pe[i],
                             theta1 = theta1,
                             theta2 = theta2,
                             CV = CV, n = n,
                             design = design)
}
pwr0.5 <- uniroot(opt, interval = c(theta1, theta0), tol = 1e-12)$root
clr    <- c("#FF0000A0", "#FF00FFA0", "#0000FFA0")
dev.new(width = 4.5, height = 4.5)
op     <- par(no.readonly = TRUE)
par(mar = c(4, 4, 0, 0))
plot(pe, res1$power, type = "n", ylim = c(0, 1), log = "x",
     axes = FALSE, xlab = "", ylab = "")
abline(v = c(pwr0.5, 1/pwr0.5),
       h = c(alpha, 0.5), lty = 2)
grid()
box()
axis(1, at = c(seq(theta1, 1.2, 0.1), theta2))
axis(1, at = c(theta1, pwr0.5, 1/pwr0.5, theta2),
     tick = FALSE, line = 2,
     labels = c(expression(theta[1]),
                expression(theta[0]),
                expression(1/theta[0]),
                expression(theta[2])))
axis(2, las = 1)
mtext(expression(italic(p)* " / power"), 2, line = 3)
lines(pe, res1$p.left, col = clr[1], lwd = 3)
lines(pe, res1$p.right, col = clr[2], lwd = 3)
lines(pe, res1$power, col = clr[3], lwd = 3)
legend("center", box.lty = 0, bg = "white", y.intersp = 1.2,
       legend = c(expression(italic(p)>=theta[1]),
                  expression(italic(p)<=theta[2]),
                  "power"), col = clr, lwd = 3, seg.len = 3)
par(op)

Fig. 2 Power curves for n = 38.

We see that the extreme T/R-ratios are 0.8795 and 1.1371. At these T/R-ratios the respective p-value of the TOST is 0.05 and the null hypothesis of inequivalence has to be accepted.

top of section ↩︎ previous section ↩︎

Misconceptions

    

As stated already above, post hoc power is not relevant in the assessment for BE. Consequently, it is not mentioned in any of the current global guidelines. Recently the WHO argued against it.

The a posteriori power of the study does not need to be calculated. The power of interest is that calculated before the study is conducted to ensure that the adequate sample size has been selected. […] The relevant power is the power to show equivalence within the pre-defined acceptance range.
WHO, 2020.11

top of section ↩︎ previous section ↩︎

‘Too low’ Power

    

Again: Either a study passed or it failed. If any of the assumptions in sample size planning (CV, T/R-ratio, anticipated dropout-rate) are not exactly realized in the study and turned out to be worse (e.g., higher CV, T/R-ratio deviating more from 1, less eligible subjects), post hoc power will be lower than targeted. Or simply:

Being lucky is not a crime.
ElMaestro, 2014.12

Since \(\small{\alpha}\) and \(\small{\beta}\) are independent, this is not of a regulatory concern (the patient’s risk is not affected).

From the results of Example 2, Method B (Walter Hauck and Donald Schirmann are pioneers of statistics in bioequivalence):

[…] we obtain a two-sided CI for the ratio (T/R) of geometric means of 88.45—116.38%, which meets the 80–125% acceptance criterion. We stop here and conclude BE, irrespective of the fact that we have not yet achieved the desired power of 80% (power = 66.3%).
Potvin et al., 2008.13
    

Let’s consider an example. Assumed CV 0.30, T/R-ratio 0.95, targeted at 80% power in a 2×2×2 crossover design:

alpha     <- 0.05
CV        <- 0.30
theta0    <- 0.95
theta1    <- 0.80
theta2    <- 1/theta1
target    <- 0.80
design    <- "2x2x2"
plan      <- data.frame(CV = CV, theta0 = theta0,
                        n = NA, power = NA)
plan[3:4] <- sampleN.TOST(alpha = alpha, CV = CV,
                          theta0 = theta0,
                          theta1 = theta1, theta2 = theta2,
                          targetpower = target,
                          design = design,
                          print = FALSE)[7:8]
print(signif(plan, 4), row.names = FALSE)
R>   CV theta0  n  power
R>  0.3   0.95 40 0.8158

So far, so good.

However, in the study the CV was higher (0.36), the PE worse (0.93), and we had two drop­outs (38 eligible subjects).

CV        <- 0.36 # higher than assumed
pe        <- 0.92 # more deviating than assumed
n         <- 38   # 2 droputs
actual    <- data.frame(CV = CV, PE = pe, n = n,
                        lower.CL = NA, upper.CL = NA,
                        BE = "fail", power = NA)
actual[4:5] <- sprintf("%.4f", CI.BE(alpha = alpha,
                                     CV = CV, pe = pe, n = n,
                                     design = design))
if (actual[4] >= theta1 & actual[5] <= theta2)
  actual$BE <- "pass"
actual[7] <- sprintf("%.4f", power.TOST(alpha = alpha,
                                        CV = CV,
                                        theta0 = pe,
                                        n = n,
                                        design = design))
print(actual, row.names = FALSE)
R>    CV   PE  n lower.CL upper.CL   BE  power
R>  0.36 0.92 38   0.8036   1.0532 pass 0.5094

The study passed, though it was a close shave. What does the post hoc power of 0.5094 tell us what we don’t already know? Nothing.

top of section ↩︎ previous section ↩︎

‘Forced’ Bioequivalence

    

This is a term sometimes used by regulatory assessors.
Stop searching: You will not find it in any publication, textbook, or guideline.

The idea behind: The sponsor had such a large budget, that the study was intentionally ‘overpowered’. What does that mean? Generally studies are powered to 80–90% (producer’s risk \(\small{\beta}\) 10–20%). If, say, the sponsor submits a protocol aiming at 99% power, it should have been rejected by the IEC / IRB. Keep in mind that it’s the job of the IEC / IRB to be concerned about the safety of subjects in the study. If members vote in favor of performing the study, all is good. Regrettably, their statistical expertise is finite or it might just slipped through their attention…

However, it might well be the opposite of what we discussed above. Say, the study was properly planned (prospective power 80%), but the realized CV lower, the PE closer to 1, and the drop­out-rate lower than anticipated. Natually, post hoc power will be higher than targeted.

CV        <- 0.24 # lower than assumed
pe        <- 0.99 # less deviating than assumed
n         <- 43   # adjusted for the anticipated
                  # dropout-rate (44 dosed), 1 dropout
actual    <- data.frame(CV = CV, PE = pe, n = n,
                        lower.CL = NA, upper.CL = NA,
                        BE = "fail", power = NA)
actual[4:5] <- sprintf("%.4f",
                       CI.BE(alpha = alpha,
                             CV = CV, pe = pe, n = n,
                             design = design))
if (actual[4] >= theta1 & actual[5] <= theta2)
  actual$BE <- "pass"
actual[7] <- sprintf("%.4f", suppressMessages(
                               power.TOST(alpha = alpha,
                                          CV = CV,
                                          theta0 = pe,
                                          n = n,
                                          design = design)))
print(actual, row.names = FALSE)
R>    CV   PE  n lower.CL upper.CL   BE  power
R>  0.24 0.99 43   0.9085   1.0788 pass 0.9908

In either case it is not of a regulatory concern (the patient’s risk is not affected).

    

In BE the pharmacokinetic metrics Cmax and AUC0–t are mandatory (in some jurisdictions like the FDA additionally AUC0–∞). We don’t have to worry about multiplicity issues (inflated Type I Error, increased patient’s risk), since if all tests must pass at level \(\small{\alpha}\), we are protected by the intersection-union principle.14 15

We design studies always for the worst case combination, i.e., based on the PK metric requiring the largest sample size.

metrics <- c("Cmax", "AUCt", "AUCinf")
alpha   <- 0.05
CV      <- c(0.25, 0.19, 0.20)
theta0  <- c(0.95, 0.95, 0.95)
theta1  <- 0.80
theta2  <- 1 / theta1
target  <- 0.80
design  <- "2x2x2"
plan    <- data.frame(metric = metrics, CV = CV,
                      theta0 = theta0, n = NA, power = NA)
for (i in 1:nrow(plan)) {
  plan[i, 4:5] <- signif(
                    sampleN.TOST(alpha = alpha,
                                 CV = CV[i],
                                 theta0 = theta0[i],
                                 theta1 = theta1,
                                 theta2 = theta2,
                                 targetpower = target,
                                 design = design,
                                 print = FALSE)[7:8], 4)
}
txt     <- paste0("Sample size based on ",
                  plan$metric[plan$n == max(plan$n)], ".\n")
print(plan, row.names = FALSE); cat(txt)
R>  metric   CV theta0  n  power
R>    Cmax 0.25   0.95 28 0.8074
R>    AUCt 0.19   0.95 18 0.8294
R>  AUCinf 0.20   0.95 20 0.8347
R> Sample size based on Cmax.

We dosed more subjects to adjust for the anticipated drop­out-rate of 10%. Let’s assumed that all of our assumptions were exactly realized for Cmax and were slightly ‘better’ for the AUCs. We had two drop­outs.

up2even <- function(n) {
  return(as.integer(2 * (n %/% 2 + as.logical(n %% 2))))
}
nadj <- function(n, do.rate) {
  return(as.integer(up2even(n / (1 - do.rate))))
}
metrics <- c("Cmax", "AUCt", "AUCinf")
alpha   <- 0.05
CV      <- c(0.25, 0.17, 0.18)
pe      <- c(0.95, 0.97, 0.96)
do.rate <- 0.1 # anticipated droput-rate 10%
n.adj   <- nadj(plan$n[plan$n == max(plan$n)], do.rate)
n.elig  <- n.adj - 2 # two dropouts
theta1  <- 0.80
theta2  <- 1 / theta1
actual  <- data.frame(metric = metrics,
                      CV = CV, PE = pe, n = n.elig,
                      lower.CL = NA, upper.CL = NA,
                      BE = "fail", power = NA)
for (i in 1:nrow(actual)) {
  actual[i, 5:6] <- sprintf("%.4f", CI.BE(alpha = alpha,
                                          CV = CV[i],
                                          pe = pe[i],
                                          n = n.elig,
                                          design = design))
  if (actual[i, 5] >= theta1 & actual[i, 6] <= theta2)
    actual$BE[i] <- "pass"
  actual[i, 8] <- sprintf("%.4f", power.TOST(alpha = alpha,
                                             CV = CV[i],
                                             theta0 = pe[i],
                                             n = n.elig,
                                             design = design))
}
print(actual, row.names = FALSE)
R>  metric   CV   PE  n lower.CL upper.CL   BE  power
R>    Cmax 0.25 0.95 30   0.8526   1.0585 pass 0.8343
R>    AUCt 0.17 0.97 30   0.9007   1.0446 pass 0.9961
R>  AUCinf 0.18 0.96 30   0.8876   1.0383 pass 0.9865

Results like that one are common in BE.

I have seen deficiency letters by regulatory assessors asking for a
»justification of too high power for AUC«.

Outright bizarre.

Another case is common in jurisdictions accepting reference-scaling only for Cmax (e.g. by ABEL). Then the sample size is driven by AUC.

metrics <- c("Cmax", "AUCt", "AUCinf")
alpha   <- 0.05
CV      <- c(0.45, 0.34, 0.36)
theta0  <- rep(0.90, 3)
theta1  <- 0.80
theta2  <- 1 / theta1
target  <- 0.80
design  <- "2x2x4"
plan    <- data.frame(metric = metrics,
                      method = c("ABEL", "ABE", "ABE"),
                      CV = CV, theta0 = theta0,
                      L = 100*theta1, U = 100*theta2,
                      n = NA, power = NA)
for (i in 1:nrow(plan)) {
  if (plan$method[i] == "ABEL") {
    plan[i, 5:6] <- round(100*scABEL(CV = CV[i]), 2)
    plan[i, 7:8] <- signif(
                      sampleN.scABEL(alpha = alpha,
                                     CV = CV[i],
                                     theta0 = theta0[i],
                                     theta1 = theta1,
                                     theta2 = theta2,
                                     targetpower = target,
                                     design = design,
                                     details = FALSE,
                                     print = FALSE)[8:9], 4)
  } else {
    plan[i, 7:8] <- signif(
                      sampleN.TOST(alpha = alpha,
                                   CV = CV[i],
                                   theta0 = theta0[i],
                                   theta1 = theta1,
                                   theta2 = theta2,
                                   targetpower = target,
                                   design = design,
                                   print = FALSE)[7:8], 4)
  }
}
txt <- paste0("Sample size based on ",
              plan$metric[plan$n == max(plan$n)], ".\n")
print(plan, row.names = FALSE); cat(txt)
R>  metric method   CV theta0     L      U  n  power
R>    Cmax   ABEL 0.45    0.9 72.15 138.59 28 0.8112
R>    AUCt    ABE 0.34    0.9 80.00 125.00 50 0.8055
R>  AUCinf    ABE 0.36    0.9 80.00 125.00 56 0.8077
R> Sample size based on AUCinf.

If the study is performed with 56 subjects and all assumed values are realized, post hoc power will be 0.9666 for Cmax.

top of section ↩︎ previous section ↩︎

Wrong Calculation

    

Power of the CI inclusion approach (or power of the TOST) has nothing to do with power of testing for a statistically significant difference of 20%. In the latter the PE is essentially ignored and power depends only on the CV and the sample size. Consequently, power will be practically16 always higher than the correct one.

alpha    <- 0.05
CV       <- 0.25
theta0   <- 0.95
theta1   <- 0.80
theta2   <- 1/theta1
target   <- 0.80
design   <- "2x2x2"
n        <- sampleN.TOST(alpha = alpha, CV = CV,
                         theta0 = theta0,
                         theta1 = theta1, theta2 = theta2,
                         targetpower = target, design = design,
                         print = FALSE)[["Sample size"]]
pe       <- 0.89
res      <- data.frame(PE = pe, lower = NA, upper = NA,
                       p.left = NA, p.right = NA,
                       power.TOST = NA, power.20 = NA)
res[2:3] <- CI.BE(alpha = alpha, pe = pe,
                  CV = CV, n = n, design = design)
res[4:5] <- pvalues.TOST(pe = pe, CV = CV, n = n,
                         design = design)
res[6]   <- power.TOST(alpha = alpha, theta0 = pe,
                       theta1 = theta1, theta2 = theta2,
                       CV = CV, n = n, design = design)
res[7]   <- power.TOST(alpha = alpha, theta0 = 1,
                       theta1 = theta1, theta2 = theta2,
                       CV = CV, n = n, design = design)
print(signif(res, 4), row.names = FALSE)
R>    PE  lower  upper  p.left   p.right power.TOST power.20
R>  0.89 0.7955 0.9957 0.05864 1.097e-05     0.4729   0.9023

In this example the study failed (although properly planned for an assumed T/R-ratio of 0.95) because the PE was only 89% and hence, the lower 90% confidence limit just 79.55%. If you prefer TOST, the study failed because \(\small{p(PE\geq\theta_1)}\) was with 0.05864 > 0.05. As to be expected for a failed study, post hoc power was < 50%.
A power of 90.23% is not even wrong. If it really would be that high, why on earth did the study fail? Amazingly, such false results are commonly seen in study reports.

Power_TOST was introduced in Phoenix WinNonlin v6.4 in 2014. During development17 it was cross-validated against power.TOST(method = "nct").18 Power_80_20 (for the FDA’s power approach19) is still available for backwards compatibility. If you were guilty of using it in the past, revise your procedures. Or even better: Stop reporting post hoc power at all.

top of section ↩︎ previous section ↩︎

Informative?

    

Post hoc power is an information sink. Say, we test in a pilot study three candidate formulations \(\small{A,B,C}\) against reference \(\small{D}\). We would select the candidate with \(\small{\min\left\{\left|\log_{e}\theta_\textrm{A}\right|,\left|\log_{e}\theta_\textrm{B}\right|,\left|\log_{e}\theta_\textrm{C}\right|\right\}}\) for the pivotal study because it is the most promising one. If two T/R-ratios (or their reciprocals) are similar, we would select the candidate with the lower CV.

Let’s simulate a study in a 4×4 Williams’ design and evalute it according to the ‘Two at a Time’ approach (for details see the article about Higher-Order Crossover Designs).
Cave: 257 LOC.

make.structure <- function(treatments = 3,
                           sequences = 6,
                           seqs, n, williams = TRUE,
                           setseed = TRUE) {
  # seq: vector of sequences (optional)
  # n  : number of subjects
  if (treatments < 3 | treatments > 5)
    stop("Only 3, 4, or 5 treatments implemented.")
  if (williams) {
    if (!sequences %in% c(4, 6, 10))
      stop("Only 4, 6, or 10 sequences in Williams\u2019 ",
           "designs implemented.")
    if (treatments == 3 & !sequences == 6)
      stop("6 sequences required for 3 treatments.")
    if (treatments == 4 & !sequences == 4)
      stop("4 sequences required for 4 treatments.")
    if (treatments == 5 & !sequences == 10)
      stop("10 sequences required for 5 treatments.")
  } else {
    if (!treatments == sequences)
      stop("Number of sequences have to equal ",
           "\nthe number of treatments in a Latin Square")
  }
  if (n %% sequences != 0) {
    msg <- paste0("\n", n, " is not a multiple of ",
                  sequences, ".",
                 "\nOnly balanced sequences implemented.")
    stop(msg)
  }
  # defaults if seqs not given
  if (is.null(seqs)) {
    if (williams) {
      if (setseed) {
        seed <- 1234567
      } else {
        seed <- runif(1, max = 1e7)
      }
      tmp      <- RL4(nsubj = n, seqs = williams(treatments),
                      blocksize = sequences, seed = seed,
                      randctrl = FALSE)$rl[, c(1, 3)]
      subj.per <- data.frame(subject = rep(1:n, each = treatments),
                             period = 1:treatments)
      data     <- merge(subj.per, tmp, by.x = c("subject"),
                        by.y = c("subject"), sort = FALSE)
    } else {
      if (treatments == 3) {
        seqs <- c("ABC", "BCA", "CAB")
      }
      if (treatments == 4) {
        seqs <- c("ABCD", "BCDA", "CDAB", "DABC")
      }
      if (treatments == 5) {
        seqs <- c("ABCDE", "BCDEA", "CDEAB", "DEABC", "EABCD")
      }
    }
  } else {
    if (!length(seqs) == sequences)
      stop("number of elements of 'seqs' have to equal 'sequences'.")
    if (!typeof(seqs) == "character")
      stop("'seqs' has to be a character vector.")
    subject  <- rep(1:n, each = treatments)
    sequence <- rep(seqs, each = n / length(seqs) * treatments)
    data     <- data.frame(subject  = subject,
                           period   = 1:treatments,
                           sequence = sequence)
  }
  for (i in 1:nrow(data)) {
    data$treatment[i] <- strsplit(data$sequence[i],
                                  "")[[1]][data$period[i]]
  }
  return(data)
}

sim.HOXover <- function(alpha = 0.05, treatments = 3, sequences = 6,
                        williams = TRUE, T, R, theta0,
                        theta1 = 0.80, theta2 = 1.25,
                        CV, n, seqs = NULL, nested = FALSE,
                        show.data = FALSE, details = FALSE,
                        power = FALSE, setseed = TRUE) {
  # simulate Higher-Order Crossover design (Williams' or Latin Squares)
  # T: vector of T-codes (character, e.g., c("A", "C")
  # R: vector of R-codes (character, e.g., c("B", "D")
  # theta0: vector of T/R-ratios
  # CV: vector of CVs of the T/R-comparisons
  if (treatments < 3)
    stop("At least 3 treatments required.")
  if (treatments > 5)
    stop("> 5 treatments not implemented.")
  comp <- paste0(T, R)
  comparisons <- length(comp)
  if (treatments == 3 & comparisons > 2)
    stop("Only 2 comparisons for 3 treatments implemented.")
  if (!length(theta0) == comparisons)
    stop("theta0 has to be a vector with ", comparisons,
         " elements.")
  if (!length(CV) == comparisons)
    stop("CV has to be a vector with ", comparisons,
         " elements.")
  # vector of ordered treatment codes
  trts <- c(T, R)[order(c(T, R))]
  heading.aovIII <- c(paste("Type III ANOVA:",
                            "Two at a Time approach"),
                             "sequence vs")
  # data.frame with columns subject, period, sequence, treatment
  data <- make.structure(treatments = treatments,
                         sequences = sequences,
                         seq = NULL, n = n,
                         williams = williams, setseed)
  if (setseed) set.seed(1234567)
  sd   <- CV2se(CV)
  Tsim <- data.frame(subject = 1:n,
                     treatment = rep(T, each = n),
                     PK = NA)
  cuts <- seq(n, nrow(Tsim), n)
  j    <- 1
  for (i in seq_along(cuts)) {
    Tsim$PK[j:cuts[i]] <- exp(rnorm(n = n,
                              mean = log(theta0[i]),
                              sd = sd[i]))
    j <- j + n
  }
  Rsim <- data.frame(subject = 1:n,
                     treatment = rep(R, each = n),
                     PK = NA)
  cuts <- seq(n, nrow(Rsim), n)
  j    <- 1
  for (i in seq_along(cuts)) {
    Rsim$PK[j:cuts[i]] <- exp(rnorm(n = n, mean = 0,
                                    sd = sd[i]))
    j <- j + n
  }
  TRsim      <- rbind(Tsim, Rsim)
  data       <- merge(data, TRsim,
                      by.x = c("subject", "treatment"),
                      by.y = c("subject", "treatment"),
                      sort = FALSE)
  cols       <- c("subject", "period",
                  "sequence", "treatment")
  data[cols] <- lapply(data[cols], factor)
  seqs  <- unique(data$sequence)
  if (show.data) print(data, row.names = FALSE)
  # named vectors of geometric means
  per.mean <- setNames(rep(NA, treatments), 1:treatments)
  trt.mean <- setNames(rep(NA, treatments), trts)
  seq.mean <- setNames(rep(NA, length(seqs)), seqs)
  for (i in 1:treatments) {
    per.mean[i] <- exp(mean(log(data$PK[data$period == i])))
    trt.mean[i] <- exp(mean(log(data$PK[data$treatment == trts[i]])))
  }
  for (i in 1:sequences) {
    seq.mean[i] <- exp(mean(log(data$PK[data$sequence == seqs[i]])))
  }
  txt <- "Geometric means of PK response\n  period"
  for (i in seq_along(per.mean)) {
    txt <- paste0(txt, "\n    ", i,
                  paste(sprintf(": %.4f", per.mean[i])))
  }
  txt <- paste(txt, "\n  sequence")
  for (i in seq_along(seq.mean)) {
    txt <- paste0(txt, "\n    ", seqs[i],
                  paste(sprintf(": %.4f", seq.mean[i])))
  }
  txt <- paste(txt, "\n  treatment")
  for (i in seq_along(trt.mean)) {
    txt <- paste0(txt, "\n    ", trts[i],
                  paste(sprintf(": %.4f", trt.mean[i])))
  }
  if (details) cat(txt, "\n")
  for (i in 1:comparisons) { # IBD comparisons
    comp.trt      <- strsplit(comp[i], "")[[1]]
    TaT           <- data[data$treatment %in% comp.trt, ]
    TaT$treatment <- as.character(TaT$treatment)
    TaT$treatment[TaT$treatment == comp.trt[1]] <- "T"
    TaT$treatment[TaT$treatment == comp.trt[2]] <- "R"
    TaT$treatment <- as.factor(TaT$treatment)
    if (nested) { # bogus
      model <- lm(log(PK) ~ sequence + subject%in%sequence +
                            period + treatment,
                            data = TaT)
    } else {      # simple for unique subject codes
      model <- lm(log(PK) ~ sequence + subject +
                            period + treatment,
                            data = TaT)
    }
    TR.est <- exp(coef(model)[["treatmentT"]])
    TR.CI  <- as.numeric(exp(confint(model, "treatmentT",
                                     level = 1 - 2 * alpha)))
    m.form <- toString(model$call)
    m.form <- paste0(substr(m.form, 5, nchar(m.form)),
                     " (", paste(comp.trt, collapse=""), ")")
    aovIII <- anova(model)
    if (nested) {
      MSdenom <- aovIII["sequence:subject", "Mean Sq"]
      df2     <- aovIII["sequence:subject", "Df"]
    } else {
      MSdenom <- aovIII["subject", "Mean Sq"]
      df2     <- aovIII["subject", "Df"]
    }
    fvalue <- aovIII["sequence", "Mean Sq"] / MSdenom
    df1    <- aovIII["sequence", "Df"]
    aovIII["sequence", 4] <- fvalue
    aovIII["sequence", 5] <- pf(fvalue, df1, df2,
                                lower.tail = FALSE)
    attr(aovIII, "heading")[1]   <- m.form
    attr(aovIII, "heading")[2:3] <- heading.aovIII
    if (nested) {
      attr(aovIII, "heading")[3] <- paste(heading.aovIII[2],
                                          "sequence:subject")
    } else {
      attr(aovIII, "heading")[3] <- paste(heading.aovIII[2],
                                          "subject")
    }
    CV.TR.est <- sqrt(exp(aovIII["Residuals", "Mean Sq"])-1)
    info <- paste0("\nIncomplete blocks extracted from ",
                   length(seqs), "\u00D7", treatments)
    if (williams) {
      info <- paste(info, "Williams\u2019 design\n")
    } else {
      info <- paste(info, "Latin Squares design\n")
    }
    if (details) {
      cat(info); print(aovIII, digits = 4, signif.legend = FALSE)
    }
    txt <- paste0("PE ", comp.trt[1], "/", comp.trt[2],
                  "         ",
                  sprintf("%6.4f", TR.est),"\n",
                  sprintf("%.f%% %s", 100*(1-2*alpha),
                  "CI         "),
                   paste(sprintf("%6.4f", TR.CI),
                         collapse = " \u2013 "))
    if (round(TR.CI[1], 4) >= theta1 &
        round(TR.CI[2], 4) <= theta2) {
      txt <- paste(txt, "(pass)")
    } else {
      txt <- paste(txt, "(fail)")
    }
    txt <- paste(txt, "\nCV (intra)    ",
                 sprintf("%6.4f", CV.TR.est))
    if (power) {
      post.hoc <- suppressMessages(
                    power.TOST(CV = CV.TR.est,
                               theta0 = TR.est,
                               n = n))
      txt <- paste(txt, "\nPost hoc power",
                   sprintf("%6.4f", post.hoc))
    }
    cat(txt, "\n")
  }
}
sim.HOXover(treatments = 4, sequences = 4,
            T = c("A", "B", "C"), R = "D",
            theta0 = c(0.93, 0.97, 0.99),
            CV = c(0.20, 0.22, 0.25), n = 16,
            nested = FALSE, details = FALSE,
            show.data = FALSE, power = TRUE,
            setseed = TRUE)
R> PE A/D         0.9395
R> 90% CI         0.8407 – 1.0499 (pass) 
R> CV (intra)     0.1789 
R> Post hoc power 0.7813 
R> PE B/D         0.9296
R> 90% CI         0.8200 – 1.0539 (pass) 
R> CV (intra)     0.2011 
R> Post hoc power 0.6402 
R> PE C/D         0.9164
R> 90% CI         0.7957 – 1.0555 (fail) 
R> CV (intra)     0.2271 
R> Post hoc power 0.4751

Based on the PEs the most promising formulation is candidate \(\small{A}\). Based on post hoc power we would come to the same conclusion. However, this is not always the case. Run the script with the argument setseed = FALSE for other simulations (up to 5 treatments in Williams’ designs and Latin Squares are supported).
I would always base the decision on the PE and CV because power depends on both and hence, is less informative.

top of section ↩︎ previous section ↩︎

Conclusion

Coming back to the question asked in the introduction. To repeat:

    What is the purpose of calculating post hoc power?

    
  • Power is not relevant in the regulatory decision process. The patient’s risk α is independent from the producer’s risk β. Regulators should be concerned only about the former.

    • High power does not further support an already passing study.
    • Low power does not refute its outcome.
    
  • If you are asked by a regulatory assessor for a ‘justification’, answer in a diplomatic way.

top of section ↩︎ previous section ↩︎

License

CC BY 4.0 Helmut Schütz 2021
1st version April 28, 2021.
Rendered 2021-05-12 17:51:57 CEST by rmarkdown in 0.48 seconds.

Footnotes and References


  1. Labes D, Schütz H, Lang B. PowerTOST: Power and Sample Size for (Bio)Equivalence Studies. 2021-01-18. CRAN.↩︎

  2. Labes D. randomizeBE: Create a Random List for Crossover Studies. 2019-08-24. CRAN.↩︎

  3. Labes D, Schütz H, Lang B. Package ‘PowerTOST’. January 18, 2021. CRAN.↩︎

  4. Zehetmayer S, Posch M. Post-hoc power estimation in large-scale multiple testing problems. Bioinformatics. 2010; 26(8): 1050–6. 10.1093/bioinformatics/btq085.↩︎

  5. Hoenig JM, Heisey DM. The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. Am. Stat. 2010; 55(1): 19–24. doi:10.1198/000313001300339897.↩︎

  6. Lenth RL. Some Practical Guidelines for Effective Sample Size Determination. Am. Stat. 2010; 55(3): 187–93. doi:10.1198/000313001317098149.↩︎

  7. Senn S. Power is indeed irrelevant in interpreting completed studies. BMJ. 2002. 325(7375); 1304. PMID 12458264. PMC Free Full Text Free Full Text.↩︎

  8. EMEA. Committee for Proprietary Medicinal Products. London. 26 July 2001. Note for Guidance on the Investigation of Bioavailability and Bioequivalence. Section 3.1 Design. CPMP/EWP/QWP/1401/98.↩︎

  9. EMA. Committee for Medicinal Products for Human Use. London. 20 January 2010. Guideline on the Investigation of Bioequivalence. Section 4.1.3 Number of subjects. CPMP/EWP/QWP/1401/98 Rev. 1/ Corr **.↩︎

  10. Schuirmann DJ. A comparison of the Two One-Sided Tests Procedure and the Power Approach for Assessing the Equivalence of Average Bioavailability. J Pharmacokin Biopharm. 1987; 15(6): 657–80. doi:10.1007/BF01068419.↩︎

  11. WHO. Geneva. November 2020. Frequent deficiencies in BE study protocols.↩︎

  12. BEBA-Forum. 20 July 2014. Revalidate the study.↩︎

  13. Potvin D, DiLiberti CE, Hauck WW, Parr AF, Schuirmann DJ, Smith RA. Sequential design approaches for bioequivalence studies with crossover designs. Pharm Stat. 2008; 7(4): 245–62. doi:10.1002/pst.294.↩︎

  14. Berger RL, Hsu JC. Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets. Stat Sci. 1996; 11(4): 283–302. JSTOR:2246021.↩︎

  15. Zeng A. The TOST confidence intervals and the coverage probabilities with R simulation. March 14, 2014.↩︎

  16. It will be the same if the PE is exactly 1.↩︎

  17. I bombarded Pharsight for years asking for an update. I was a beta-tester of PHX v6.0–6.4. Guess who suggested which software to compare to.↩︎

  18. Approximation by the noncentral t-distribution.↩︎

  19. I don’t know when the 80/20-rule was introduced (mid-1980s?). Schuirmann showed in 1987 why it is crap.↩︎