Consider allowing JavaScript. Otherwise, you have to be proficient in reading since formulas will not be rendered. Furthermore, the table of contents in the left column for navigation will not be available and code-folding not supported. Sorry for the inconvenience.

Examples in this article were generated with
4.2.1 by the packages ``PowerTOST`

^{1} and
`randomizeBE`

.^{2} See also the README on GitHub
for an overview and the online manual^{3} for details.

- The right-hand badges give the respective section’s ‘level’.

- Basics about power and sample size methodology – requiring no or only limited statistical expertise.

- These sections are the most important ones. They are – hopefully – easily comprehensible even for novices.

- A somewhat higher knowledge of statistics and/or R is required. May be skipped or reserved for a later reading.

- An advanced knowledge of statistics and/or R is required. Not recommended for beginners in particular.

- Click to show / hide R code.

What is the purpose of calculating
*post hoc* Power?

Sometimes regulatory assessors ask for the
‘justification’ of ‘low’ or ‘high’ power in an equivalence
trial.

I will try to clarify why such a justification is meaningless and – a bit provocative – asking for one demonstrates a lack of understanding the underlying statistical concepts (see another article for details about inferential statistics and hypotheses in equivalence).

As an aside, in bioinformatics – where *thousands* of test are
performed simultaneously – *post hoc* power is indeed a suitable
tool to select promising hypotheses.^{4}

In the case of a single test (like in bioequivalence) it has been
extensively criticized.^{5} ^{6} ^{7}

It *may* be of interest for the sponsor if another study is
planned.

A *basic* knowledge of R is
required. To run the scripts at least version 1.4.3 (2016-11-01) of
`PowerTOST`

is suggested. Any version of R would likely do, though the current release of
`PowerTOST`

was only tested with version 4.1.3 (2022-03-10)
and later. At least version 0.3.1 (2012-12-27) of
`randomizeBE`

is required.

All scripts were run on a Xeon E3-1245v3 @ 3.40GHz (1/4 cores) 16GB RAM
with R 4.2.1 on Windows 7 build 7601, Service
Pack 1, Universal C Runtime 10.0.10240.16390.

```
library(PowerTOST) # attach required ones
library(randomizeBE) # to run the examples
```

“There is simple intuition behind results like these: If my car made it to the top of the hill, then it is powerful enough to climb that hill; if it didn’t, then it obviously isn’t powerful enough. Retrospective power is an obvious answer to a rather uninteresting question. A more meaningful question is to ask whether the car is powerful enough to climb a particular hill never climbed before; or whether a different car can climb that new hill. Such questions are prospective, not retrospective.

In order to get prospective (*a priori*) power and hence,
estimate a sample size, we need five values:

- The level of the test (in bioequivalence commonly \(\small{\alpha=0.05}\)),
- the BE-margins (commonly \(\small{80-125\%}\)),
- the desired (or target) power \(\small{\pi}\), where the producer’s risk of failure \(\small{\beta=1-\pi}\),
- the variance \(\small{s^2}\) (commonly expressed as a coefficient of variation \(\small{CV}\)), and
- the deviation \(\small{\Delta}\) of the test from the reference treatment.

1 & 2 are fixed by the
agency,

3 is set by the sponsor
(commonly to \(\small{80-90\%}\)),
and

4 & 5 are – uncertain! – assumptions.

The European Note for Guidance of 2001^{9} was specific in this
respect:

“The number of subjects required is determined byUnfortunately the current guideline

- the error variance associated with the primary characteristic to be studied as estimated from a pilot experiment, from previous studies or from published data,
- the significance level desired,
- the expected deviation from the reference product compatible with bioequivalence (delta) and
- the required power.

»*The number of subjects to be included in the study should be based
on an appropriate sample size calculation.*«

May I ask what’s *appropriate*?

However, all of that deals with the design of the study. From a regulatory perspective the
outcome of a comparative
bioavailability study is dichotomous:

Either bioequivalence was demonstrated (the 90%
confidence interval lies entirely within the pre-specified acceptance
range) or not.

- The patient’s risk is always strictly controlled (
*α*≤ 0.05).

*Post hoc*(a.k.a.*a posteriori*, retrospective) power and the producer’s risk*β*do not play*any*role in this decision,*i.e.*, an agency should not even ask for it because*β*is independent from*α*.

Realization: Observations (in a sample) of a random variable (of the
population).

Since it is extremely unlikely that all assumptions made in designing
the study will be *exactly* realized, it is of limited value to
assess *post hoc* power – unless the study failed and we want to
design another one.

Let’s explore the properties of prospective power in an example.
Assumed *CV* 0.25, T/R-ratio 0.95, targeted at 90% power in a
2×2×2 crossover design:

```
<- 0.05
alpha <- 0.25
CV <- 0.95
theta0 <- 0.80
theta1 <- 1/theta1
theta2 <- 0.90
target <- "2x2x2"
design <- sampleN.TOST(alpha = alpha, CV = CV,
n theta0 = theta0,
theta1 = theta1, theta2 = theta2,
targetpower = target,
design = design,
print = FALSE)[["Sample size"]]
<- sort(unique(c(seq(theta1, 1, length.out = 100),
pe 1/seq(theta1, 1, length.out = 100))))
<- data.frame(pe = pe, p.left = NA, p.right = NA,
res1 power = NA)
for (i in 1:nrow(res1)) {
2:3] <- pvalues.TOST(pe = pe[i],
res1[i, theta1 = theta1,
theta2 = theta2,
CV = CV, n = n,
design = design)
4] <- power.TOST(alpha = alpha,
res1[i, theta0 = pe[i],
theta1 = theta1,
theta2 = theta2,
CV = CV, n = n,
design = design)
}<- c("#FF0000A0", "#FF00FFA0", "#0000FFA0")
clr dev.new(width = 4.5, height = 4.5)
<- par(no.readonly = TRUE)
op par(mar = c(4, 4, 0, 0))
plot(pe, res1$power, type = "n", ylim = c(0, 1), log = "x",
axes = FALSE, xlab = "", ylab = "")
abline(v = c(theta0, 1/theta0),
h = c(alpha, 0.5), lty = 2)
grid()
box()
axis(1, at = c(seq(theta1, 1.2, 0.1), theta2))
axis(1, at = c(theta1, theta0, 1/theta0, theta2),
tick = FALSE, line = 2,
labels = c(expression(theta[1]),
expression(theta[0]),
expression(1/theta[0]),
expression(theta[2])))
axis(2, las = 1)
mtext(expression(italic(p)* " / power"), 2, line = 3)
lines(pe, res1$p.left, col = clr[1], lwd = 3)
lines(pe, res1$p.right, col = clr[2], lwd = 3)
lines(pe, res1$power, col = clr[3], lwd = 3)
legend("center", box.lty = 0, bg = "white", y.intersp = 1.2,
legend = c(expression(italic(p)>=theta[1]),
expression(italic(p)<=theta[2]),
"power"), col = clr, lwd = 3, seg.len = 3)
par(op)
```

With the assumed \(\small{\theta_0}\) of 0.9500 (and its
reciprocal of 1.0526) we achieve the target power. If \(\small{\theta_0}\) will be closer to 1, we
gain power. With \(\small{\theta_0}\)
at the BE-margins \(\small{\left\{\theta_1,\theta_2\right\}}\)
it will lead to a false decision (probability \(\small{\alpha=0.05}\) of the Type I Error).
Note that at the margins the respective *p*-value of the
TOST procedure^{11} is 0.5 (like tossing
a coin).

We can reverse this game and ask at which T/R-ratios power will be just 0.5 (with more deviating ones the chance to demonstrate BE is futile).

```
<- function(x) {
opt power.TOST(theta0 = x, alpha = alpha,
theta1 = theta1, theta2 = theta2,
design = design,
CV = CV, n = n) - 0.5
}<- 0.05
alpha <- 0.25
CV <- 0.95
theta0 <- 0.80
theta1 <- 1/theta1
theta2 <- 0.90
target <- "2x2x2"
design <- sampleN.TOST(alpha = alpha, CV = CV,
n theta0 = theta0,
theta1 = theta1, theta2 = theta2,
targetpower = target,
design = design,
print = FALSE)[["Sample size"]]
<- sort(unique(c(seq(theta1, 1, length.out = 100),
pe 1/seq(theta1, 1, length.out = 100))))
<- data.frame(pe = pe, p.left = NA, p.right = NA,
res1 power = NA)
for (i in 1:nrow(res1)) {
2:3] <- pvalues.TOST(pe = pe[i],
res1[i, theta1 = theta1,
theta2 = theta2,
CV = CV, n = n,
design = design)
4] <- power.TOST(alpha = alpha,
res1[i, theta0 = pe[i],
theta1 = theta1,
theta2 = theta2,
CV = CV, n = n,
design = design)
}.5 <- uniroot(opt, interval = c(theta1, theta0), tol = 1e-12)$root
pwr0<- c("#FF0000A0", "#FF00FFA0", "#0000FFA0")
clr dev.new(width = 4.5, height = 4.5)
<- par(no.readonly = TRUE)
op par(mar = c(4, 4, 0, 0))
plot(pe, res1$power, type = "n", ylim = c(0, 1), log = "x",
axes = FALSE, xlab = "", ylab = "")
abline(v = c(pwr0.5, 1/pwr0.5),
h = c(alpha, 0.5), lty = 2)
grid()
box()
axis(1, at = c(seq(theta1, 1.2, 0.1), theta2))
axis(1, at = c(theta1, pwr0.5, 1/pwr0.5, theta2),
tick = FALSE, line = 2,
labels = c(expression(theta[1]),
expression(theta[0]),
expression(1/theta[0]),
expression(theta[2])))
axis(2, las = 1)
mtext(expression(italic(p)* " / power"), 2, line = 3)
lines(pe, res1$p.left, col = clr[1], lwd = 3)
lines(pe, res1$p.right, col = clr[2], lwd = 3)
lines(pe, res1$power, col = clr[3], lwd = 3)
legend("center", box.lty = 0, bg = "white", y.intersp = 1.2,
legend = c(expression(italic(p)>=theta[1]),
expression(italic(p)<=theta[2]),
"power"), col = clr, lwd = 3, seg.len = 3)
par(op)
```

We see that the extreme T/R-ratios are 0.8795 and 1.1371. At these
T/R-ratios the respective *p*-value of the
TOST is 0.05 and the null
hypothesis of inequivalence has to be accepted.

As stated already above, *post hoc* power is not relevant in
the assessment for BE. Consequently,
it is *not* mentioned in *any* (‼) of the current global
guidelines. Recently the
WHO argued against it.^{12}

“Thea posterioripower of the study does not need to be calculated. The power of interest is that calculated before the study is conducted to ensure that the adequate sample size has been selected. […] The relevant power is the power to show equivalence within the pre-defined acceptance range.

top of section ↩︎ previous section ↩︎

Again: **Either** a study passed **or** it
failed. If any of the assumptions in sample size planning (*CV*,
T/R-ratio, anticipated dropout-rate) are not *exactly* realized
in the study *and* turned out to be worse (*e.g.*, higher
*CV*, T/R-ratio deviating more from 1, less eligible subjects),
*post hoc* power will be lower than targeted. Or simply:

“Being lucky is not a crime.

Since \(\small{\alpha}\) and \(\small{\beta}\) are independent, this is
*not* of a regulatory concern (the patient’s risk is not
affected).

“*Congratulations! However, since this achievement was improbable, we
don’t pay you anything.*”

From the results of Example 2, Method B (Walter Hauck and Donald Schuirmann are pioneers of statistics in bioequivalence):

“[…] we obtain a two-sided CI for the ratio (T/R) of geometric means of 88.45—116.38%, which meets the 80–125% acceptance criterion. We stop here and conclude BE, irrespective of the fact that we have not yet achieved the desired power of 80% (power = 66.3%).

Let’s consider an example. Assumed *CV* 0.30, T/R-ratio 0.95,
targeted at 80% power in a 2×2×2 crossover design:

```
<- 0.05
alpha <- 0.30
CV <- 0.95
theta0 <- 0.80
theta1 <- 1/theta1
theta2 <- 0.80
target <- "2x2x2"
design <- data.frame(CV = CV, theta0 = theta0,
plan n = NA, power = NA)
3:4] <- sampleN.TOST(alpha = alpha, CV = CV,
plan[theta0 = theta0,
theta1 = theta1, theta2 = theta2,
targetpower = target,
design = design,
print = FALSE)[7:8]
print(signif(plan, 4), row.names = FALSE)
```

```
# CV theta0 n power
# 0.3 0.95 40 0.8158
```

So far, so good.

However, in the study the *CV* was higher (0.36), the Point
Estimate worse (0.92), and we had two dropouts (38 eligible
subjects).

```
<- 0.05
alpha <- 0.80
theta1 <- 1/theta1
theta2 <- "2x2x2"
design <- 0.36 # higher than assumed
CV <- 0.92 # more deviating than assumed
pe <- 38 # 2 dropouts
n <- data.frame(CV = CV, PE = pe, n = n,
actual lower.CL = NA, upper.CL = NA,
BE = "fail", power = NA, TIE = NA)
4:5] <- sprintf("%.4f", CI.BE(alpha = alpha, CV = CV,
actual[pe = pe, n = n, design = design))
if (actual[4] >= theta1 & actual[5] <= theta2) actual$BE <- "pass"
7] <- sprintf("%.4f", power.TOST(alpha = alpha, CV = CV,
actual[theta0 = pe, n = n,
design = design))
8] <- sprintf("%.6f", power.TOST(alpha = alpha, CV = CV,
actual[theta0 = theta1, n = n,
design = design))
print(actual, row.names = FALSE)
```

```
# CV PE n lower.CL upper.CL BE power TIE
# 0.36 0.92 38 0.8036 1.0532 pass 0.5094 0.049933
```

The study passed, though it was a close shave. What does the *post
hoc* power of 0.5094 tell us what we don’t already know? Nothing. Is
the patient’s risk compromised? Of course, not (0.049933 < 0.05).

This is a term sometimes used by regulatory assessors.

Stop searching: You will not find it in any publication, textbook, or
guideline.

The idea behind: The sponsor had such a large budget, that the study was intentionally ‘overpowered’. What does that mean? Generally studies are powered to 80 – 90% (producer’s risk \(\small{\beta}\) 10 – 20%). If, say, the sponsor submitted a protocol aiming at 99% power, it should have been rejected by the IEC / IRB. Keep in mind that it’s the job of the IEC / IRB to be concerned about the safety of subjects in the study. If members voted in favor of performing the study, all is good. Regrettably, their statistical expertise is limited or it might have just slipped through their attention…

However, it might well be the opposite of what we discussed above. Say, the study was properly planned
(prospective power ≥ 80%), but the realized *CV* lower, the
PE closer to 1, and the dropout-rate
lower than the anticipated 10%.

```
<- function(n) {
up2even return(as.integer(2 * (n %/% 2 + as.logical(n %% 2))))
}<- function(n, do.rate) {
nadj return(as.integer(up2even(n / (1 - do.rate))))
}<- 0.10 # 10%
do.rate <- data.frame(status = "plan", CV = 0.29, theta0 = 0.95,
plan target = 0.80, n = NA, power = NA)
5:6] <- sampleN.TOST(CV = plan$CV, theta0 = plan$theta0,
plan[targetpower = plan$target, print = FALSE,
details = FALSE)[7:8]
6] <- sprintf("%.4f", plan[6])
plan[$dosed <- nadj(plan$n, do.rate)
plan<- data.frame(status = "actual", CV = 0.24, PE = 0.99, n = 43,
actual lower.CL = NA, upper.CL = NA, BE = "fail",
power = NA)
5:6] <- sprintf("%.4f", CI.BE(CV = actual$CV,
actual[pe = actual$PE,
n = actual$n))
if (actual[4] >= 0.80 & actual[5] <= 1.25)
$BE <- "pass"
actual8] <- sprintf("%.4f", suppressMessages(
actual[power.TOST(CV = actual$CV,
theta0 = actual$PE,
n = actual$n)))
print(plan, row.names = FALSE)
cat("\n")
print(actual, row.names = FALSE)
```

```
# status CV theta0 target n power dosed
# plan 0.29 0.95 0.8 38 0.8202 44
#
# status CV PE n lower.CL upper.CL BE power
# actual 0.24 0.99 43 0.9085 1.0788 pass 0.9908
```

Natually, *post hoc* power was substantally higher than
targeted. Again, it is *not* of a regulatory concern (the
patient’s risk is not affected).

Don’t believe it? Calculate the Type I Error (*i.e.*, the
probability that the Null hypothesis of
bio** in**equivalence is true) by setting \(\small{\theta_0=\theta_1}\) or \(\small{\theta_0=\theta_2}\).

```
<- data.frame(CV = actual$CV, n = actual$n,
tie theta0 = c(0.80, 1.25), p = NA_real_)
$p[1] <- suppressMessages(
tiepower.TOST(CV = actual$CV, theta0 = 0.80,
n = actual$n))
$p[2] <- suppressMessages(
tiepower.TOST(CV = actual$CV, theta0 = 1.25,
n = actual$n))
names(tie)[4] <- "p (Null)"
print(tie, digits = 12, row.names = FALSE)
```

```
# CV n theta0 p (Null)
# 0.24 43 0.80 0.0499999999981
# 0.24 43 1.25 0.0499999999981
```

In BE the pharmacokinetic metrics
\(\small{C_\textrm{max}}\) and \(\small{AUC_{0-\textrm{t}}}\) are mandatory
(in some jurisdictions like the
FDA additionally
\(\small{AUC_{0-\infty}}\)). We don’t
have to worry about multiplicity issues (inflated Type I Error,
increased patient’s risk), since if *all* tests must pass at
level \(\small{\alpha}\), we are
protected by the intersection-union principle.^{15} ^{16}

We design studies always for the worst case combination,
*i.e.*, based on the PK
metric requiring the largest sample size.

```
<- c("Cmax", "AUCt", "AUCinf")
metrics <- 0.05
alpha <- c(0.25, 0.19, 0.20)
CV <- c(0.95, 0.95, 0.95)
theta0 <- 0.80
theta1 <- 1 / theta1
theta2 <- 0.80
target <- "2x2x2"
design <- data.frame(metric = metrics, CV = CV,
plan theta0 = theta0, n = NA, power = NA)
for (i in 1:nrow(plan)) {
4:5] <- signif(
plan[i, sampleN.TOST(alpha = alpha,
CV = CV[i],
theta0 = theta0[i],
theta1 = theta1,
theta2 = theta2,
targetpower = target,
design = design,
print = FALSE)[7:8], 4)
}<- paste0("Sample size based on ",
txt $metric[plan$n == max(plan$n)], ".\n")
planprint(plan, row.names = FALSE)
cat(txt)
```

```
# metric CV theta0 n power
# Cmax 0.25 0.95 28 0.8074
# AUCt 0.19 0.95 18 0.8294
# AUCinf 0.20 0.95 20 0.8347
# Sample size based on Cmax.
```

Say, we dosed more subjects to adjust for the anticipated dropout-rate of 10%. Let’s assume that all of our assumptions were exactly realized for \(\small{C_\textrm{max}}\) and were slightly ‘better’ for the \(\small{AUC \textrm{s}}\). There were two dropouts.

```
<- function(n) {
up2even # round up to the next even number to get balanced sequences
return(as.integer(2 * (n %/% 2 + as.logical(n %% 2))))
}<- function(n, do.rate) {
nadj return(as.integer(up2even(n / (1 - do.rate))))
}<- c("Cmax", "AUCt", "AUCinf")
metrics <- 0.05
alpha <- c(0.25, 0.17, 0.18)
CV <- c(0.95, 0.97, 0.96)
pe <- 0.1 # anticipated droput-rate 10%
do.rate <- nadj(plan$n[plan$n == max(plan$n)], do.rate)
n.adj <- n.adj - 2 # two dropouts
n.elig <- 0.80
theta1 <- 1 / theta1
theta2 <- data.frame(metric = metrics,
actual CV = CV, PE = pe, n = n.elig,
lower.CL = NA, upper.CL = NA,
BE = "fail", power = NA, TIE = NA)
for (i in 1:nrow(actual)) {
5:6] <- sprintf("%.4f", CI.BE(alpha = alpha,
actual[i, CV = CV[i],
pe = pe[i],
n = n.elig,
design = design))
if (actual[i, 5] >= theta1 & actual[i, 6] <= theta2)
$BE[i] <- "pass"
actual8] <- sprintf("%.4f", power.TOST(alpha = alpha,
actual[i, CV = CV[i],
theta0 = pe[i],
n = n.elig,
design = design))
9] <- sprintf("%.9f", power.TOST(alpha = alpha,
actual[i, CV = CV[i],
theta0 = theta2,
n = n.elig,
design = design))
}print(actual, row.names = FALSE)
```

```
# metric CV PE n lower.CL upper.CL BE power TIE
# Cmax 0.25 0.95 30 0.8526 1.0585 pass 0.8343 0.049999898
# AUCt 0.17 0.97 30 0.9007 1.0446 pass 0.9961 0.050000000
# AUCinf 0.18 0.96 30 0.8876 1.0383 pass 0.9865 0.050000000
```

Such results are common in BE. As always, the patient’s risk is ≤ 0.05.

I have seen deficiency letters by regulatory assessors asking for a
»*justification of too high power for AUC*«.

Outright bizarre.

Another case is common in jurisdictions accepting reference-scaling
only for \(\small{C_\textrm{max}}\)
(*e.g.* by
ABEL).
Then quite often the sample size depends on \(\small{AUC}\).

```
<- c("Cmax", "AUCt", "AUCinf")
metrics <- 0.05
alpha <- c(0.45, 0.34, 0.36)
CV <- rep(0.90, 3)
theta0 <- 0.80
theta1 <- 1 / theta1
theta2 <- 0.80
target <- "2x2x4"
design <- data.frame(metric = metrics,
plan method = c("ABEL", "ABE", "ABE"),
CV = CV, theta0 = theta0,
L = 100*theta1, U = 100*theta2,
n = NA, power = NA)
for (i in 1:nrow(plan)) {
if (plan$method[i] == "ABEL") {
5:6] <- round(100*scABEL(CV = CV[i]), 2)
plan[i, 7:8] <- signif(
plan[i, sampleN.scABEL(alpha = alpha,
CV = CV[i],
theta0 = theta0[i],
theta1 = theta1,
theta2 = theta2,
targetpower = target,
design = design,
details = FALSE,
print = FALSE)[8:9], 4)
else {
}7:8] <- signif(
plan[i, sampleN.TOST(alpha = alpha,
CV = CV[i],
theta0 = theta0[i],
theta1 = theta1,
theta2 = theta2,
targetpower = target,
design = design,
print = FALSE)[7:8], 4)
}
}<- paste0("Sample size based on ",
txt $metric[plan$n == max(plan$n)], ".\n")
planprint(plan, row.names = FALSE)
cat(txt)
```

```
# metric method CV theta0 L U n power
# Cmax ABEL 0.45 0.9 72.15 138.59 28 0.8112
# AUCt ABE 0.34 0.9 80.00 125.00 50 0.8055
# AUCinf ABE 0.36 0.9 80.00 125.00 56 0.8077
# Sample size based on AUCinf.
```

If the study is performed with 56 subjects and all assumed values are
realized, *post hoc* power will be 0.9666 for \(\small{C_\textrm{max}}\).

Power of the CI inclusion
approach (or power of the TOST)
has nothing to do with power of testing for a statistically significant
difference of 20%. In the latter the
PE is essentially ignored and power
depends only on the *CV* and the sample size. Consequently, power
will be practically^{17} always higher than the correct one.

```
<- 0.05
alpha <- 0.25
CV <- 0.95
theta0 <- 0.80
theta1 <- 1/theta1
theta2 <- 0.80
target <- "2x2x2"
design <- sampleN.TOST(alpha = alpha, CV = CV,
n theta0 = theta0,
theta1 = theta1, theta2 = theta2,
targetpower = target, design = design,
print = FALSE)[["Sample size"]]
<- 0.89
pe <- data.frame(PE = pe, lower = NA, upper = NA,
res p.left = NA, p.right = NA,
power.TOST = NA, power.20 = NA)
2:3] <- CI.BE(alpha = alpha, pe = pe,
res[CV = CV, n = n, design = design)
4:5] <- pvalues.TOST(pe = pe, CV = CV, n = n,
res[design = design)
6] <- power.TOST(alpha = alpha, theta0 = pe,
res[theta1 = theta1, theta2 = theta2,
CV = CV, n = n, design = design)
7] <- power.TOST(alpha = alpha, theta0 = 1,
res[theta1 = theta1, theta2 = theta2,
CV = CV, n = n, design = design)
names(res)[7] <- "power (80/20)"
print(signif(res, 4), row.names = FALSE)
```

```
# PE lower upper p.left p.right power.TOST power (80/20)
# 0.89 0.7955 0.9957 0.05864 1.097e-05 0.4729 0.9023
```

In this example the study failed (although properly planned for an
assumed T/R-ratio of 0.95) because the
PE was only 89% and hence, the lower
90% confidence limit just 79.55%. If you prefer
TOST, the study failed because
\(\small{p(PE\geq\theta_1)}\) was with
0.05864 > 0.05 and the Null hypothesis of inequivalence could not be
rejected. Like in any failed study, *post hoc* power was
< 50%.

A power of 90.23% is not even wrong.
If it really would have been that high, why on earth did the study fail?
Amazingly, such false results are all too often seen in study
reports.

`Power_TOST`

^{18} was introduced in Phoenix WinNonlin^{19} v6.4 in
2014. During development^{20} it was cross-validated against the R package `PowerTOST`

, function
`power.TOST(method = "nct")`

.^{21}

`Power_80_20`

for the
FDA’s power
approach of 1972 (‼)^{22} was abandonded after Schuirmann^{11} showed why it is crap.^{23} It is still
available^{24} for ‘backwards compatibility’. If you
were guilty of using it in the past, revise your procedures. Or even
better: Stop reporting *post hoc* power at
all.

*Post hoc* power is an information sink.

Say, we test in a pilot study three candidate formulations \(\small{A,B,C}\) against reference \(\small{D}\). We would select the candidate
with \(\small{\min\left\{\left|\log_{e}\theta_\textrm{A}\right|,\left|\log_{e}\theta_\textrm{B}\right|,\left|\log_{e}\theta_\textrm{C}\right|\right\}}\)
for the pivotal study because it is the most promising one. If two
T/R-ratios (or their reciprocals) are similar, we would select the
candidate with the lower *CV*.

Let’s simulate a study in a 4×4 Williams’ design and evalute it according to the ‘Two at a Time’ approach (for details see the article about Higher-Order Crossover Designs). Cave: 257 LOC.

```
<- function(treatments = 3,
make.structure sequences = 6,
williams = TRUE,
seqs, n, setseed = TRUE) {
# seq: vector of sequences (optional)
# n : number of subjects
if (treatments < 3 | treatments > 5)
stop("Only 3, 4, or 5 treatments implemented.")
if (williams) {
if (!sequences %in% c(4, 6, 10))
stop("Only 4, 6, or 10 sequences in Williams\u2019 ",
"designs implemented.")
if (treatments == 3 & !sequences == 6)
stop("6 sequences required for 3 treatments.")
if (treatments == 4 & !sequences == 4)
stop("4 sequences required for 4 treatments.")
if (treatments == 5 & !sequences == 10)
stop("10 sequences required for 5 treatments.")
else {
}if (!treatments == sequences)
stop("Number of sequences have to equal ",
"\nthe number of treatments in a Latin Square")
}if (n %% sequences != 0) {
<- paste0("\n", n, " is not a multiple of ",
msg ".",
sequences, "\nOnly balanced sequences implemented.")
stop(msg)
}# defaults if seqs not given
if (is.null(seqs)) {
if (williams) {
if (setseed) {
<- 1234567
seed else {
}<- runif(1, max = 1e7)
seed
}<- RL4(nsubj = n, seqs = williams(treatments),
tmp blocksize = sequences, seed = seed,
randctrl = FALSE)$rl[, c(1, 3)]
<- data.frame(subject = rep(1:n, each = treatments),
subj.per period = 1:treatments)
<- merge(subj.per, tmp, by.x = c("subject"),
data by.y = c("subject"), sort = FALSE)
else {
}if (treatments == 3) {
<- c("ABC", "BCA", "CAB")
seqs
}if (treatments == 4) {
<- c("ABCD", "BCDA", "CDAB", "DABC")
seqs
}if (treatments == 5) {
<- c("ABCDE", "BCDEA", "CDEAB", "DEABC", "EABCD")
seqs
}
}else {
}if (!length(seqs) == sequences)
stop("number of elements of 'seqs' have to equal 'sequences'.")
if (!typeof(seqs) == "character")
stop("'seqs' has to be a character vector.")
<- rep(1:n, each = treatments)
subject <- rep(seqs, each = n / length(seqs) * treatments)
sequence <- data.frame(subject = subject,
data period = 1:treatments,
sequence = sequence)
}for (i in 1:nrow(data)) {
$treatment[i] <- strsplit(data$sequence[i],
data"")[[1]][data$period[i]]
}return(data)
}
<- function(alpha = 0.05, treatments = 3, sequences = 6,
sim.HOXover williams = TRUE, T, R, theta0,
theta1 = 0.80, theta2 = 1.25,
seqs = NULL, nested = FALSE,
CV, n, show.data = FALSE, details = FALSE,
power = FALSE, setseed = TRUE) {
# simulate Higher-Order Crossover design (Williams' or Latin Squares)
# T: vector of T-codes (character, e.g., c("A", "C")
# R: vector of R-codes (character, e.g., c("B", "D")
# theta0: vector of T/R-ratios
# CV: vector of CVs of the T/R-comparisons
if (treatments < 3)
stop("At least 3 treatments required.")
if (treatments > 5)
stop("> 5 treatments not implemented.")
<- paste0(T, R)
comp <- length(comp)
comparisons if (treatments == 3 & comparisons > 2)
stop("Only 2 comparisons for 3 treatments implemented.")
if (!length(theta0) == comparisons)
stop("theta0 has to be a vector with ", comparisons,
" elements.")
if (!length(CV) == comparisons)
stop("CV has to be a vector with ", comparisons,
" elements.")
# vector of ordered treatment codes
<- c(T, R)[order(c(T, R))]
trts <- c(paste("Type III ANOVA:",
heading.aovIII "Two at a Time approach"),
"sequence vs")
# data.frame with columns subject, period, sequence, treatment
<- make.structure(treatments = treatments,
data sequences = sequences,
seq = NULL, n = n,
williams = williams, setseed)
if (setseed) set.seed(1234567)
<- CV2se(CV)
sd <- data.frame(subject = 1:n,
Tsim treatment = rep(T, each = n),
PK = NA)
<- seq(n, nrow(Tsim), n)
cuts <- 1
j for (i in seq_along(cuts)) {
$PK[j:cuts[i]] <- exp(rnorm(n = n,
Tsimmean = log(theta0[i]),
sd = sd[i]))
<- j + n
j
}<- data.frame(subject = 1:n,
Rsim treatment = rep(R, each = n),
PK = NA)
<- seq(n, nrow(Rsim), n)
cuts <- 1
j for (i in seq_along(cuts)) {
$PK[j:cuts[i]] <- exp(rnorm(n = n, mean = 0,
Rsimsd = sd[i]))
<- j + n
j
}<- rbind(Tsim, Rsim)
TRsim <- merge(data, TRsim,
data by.x = c("subject", "treatment"),
by.y = c("subject", "treatment"),
sort = FALSE)
<- c("subject", "period",
cols "sequence", "treatment")
<- lapply(data[cols], factor)
data[cols] <- unique(data$sequence)
seqs if (show.data) print(data, row.names = FALSE)
# named vectors of geometric means
<- setNames(rep(NA, treatments), 1:treatments)
per.mean <- setNames(rep(NA, treatments), trts)
trt.mean <- setNames(rep(NA, length(seqs)), seqs)
seq.mean for (i in 1:treatments) {
<- exp(mean(log(data$PK[data$period == i])))
per.mean[i] <- exp(mean(log(data$PK[data$treatment == trts[i]])))
trt.mean[i]
}for (i in 1:sequences) {
<- exp(mean(log(data$PK[data$sequence == seqs[i]])))
seq.mean[i]
}<- "Geometric means of PK response\n period"
txt for (i in seq_along(per.mean)) {
<- paste0(txt, "\n ", i,
txt paste(sprintf(": %.4f", per.mean[i])))
}<- paste(txt, "\n sequence")
txt for (i in seq_along(seq.mean)) {
<- paste0(txt, "\n ", seqs[i],
txt paste(sprintf(": %.4f", seq.mean[i])))
}<- paste(txt, "\n treatment")
txt for (i in seq_along(trt.mean)) {
<- paste0(txt, "\n ", trts[i],
txt paste(sprintf(": %.4f", trt.mean[i])))
}if (details) cat(txt, "\n")
for (i in 1:comparisons) {# IBD comparisons
<- strsplit(comp[i], "")[[1]]
comp.trt <- data[data$treatment %in% comp.trt, ]
TaT $treatment <- as.character(TaT$treatment)
TaT$treatment[TaT$treatment == comp.trt[1]] <- "T"
TaT$treatment[TaT$treatment == comp.trt[2]] <- "R"
TaT$treatment <- as.factor(TaT$treatment)
TaTif (nested) {# bogus
<- lm(log(PK) ~ sequence + subject%in%sequence +
model + treatment,
period data = TaT)
else { # simple for unique subject codes
}<- lm(log(PK) ~ sequence + subject +
model + treatment,
period data = TaT)
}<- exp(coef(model)[["treatmentT"]])
TR.est <- as.numeric(exp(confint(model, "treatmentT",
TR.CI level = 1 - 2 * alpha)))
<- toString(model$call)
m.form <- paste0(substr(m.form, 5, nchar(m.form)),
m.form " (", paste(comp.trt, collapse=""), ")")
<- anova(model)
aovIII if (nested) {
<- aovIII["sequence:subject", "Mean Sq"]
MSdenom <- aovIII["sequence:subject", "Df"]
df2 else {
}<- aovIII["subject", "Mean Sq"]
MSdenom <- aovIII["subject", "Df"]
df2
}<- aovIII["sequence", "Mean Sq"] / MSdenom
fvalue <- aovIII["sequence", "Df"]
df1 "sequence", 4] <- fvalue
aovIII["sequence", 5] <- pf(fvalue, df1, df2,
aovIII[lower.tail = FALSE)
attr(aovIII, "heading")[1] <- m.form
attr(aovIII, "heading")[2:3] <- heading.aovIII
if (nested) {
attr(aovIII, "heading")[3] <- paste(heading.aovIII[2],
"sequence:subject")
else {
}attr(aovIII, "heading")[3] <- paste(heading.aovIII[2],
"subject")
}<- sqrt(exp(aovIII["Residuals", "Mean Sq"])-1)
CV.TR.est <- paste0("\nIncomplete blocks extracted from ",
info length(seqs), "\u00D7", treatments)
if (williams) {
<- paste(info, "Williams\u2019 design\n")
info else {
}<- paste(info, "Latin Squares design\n")
info
}if (details) {
cat(info); print(aovIII, digits = 4, signif.legend = FALSE)
}<- paste0("PE ", comp.trt[1], "/", comp.trt[2],
txt " ",
sprintf("%6.4f", TR.est),"\n",
sprintf("%.f%% %s", 100*(1-2*alpha),
"CI "),
paste(sprintf("%6.4f", TR.CI),
collapse = " \u2013 "))
if (round(TR.CI[1], 4) >= theta1 &
round(TR.CI[2], 4) <= theta2) {
<- paste(txt, "(pass)")
txt else {
}<- paste(txt, "(fail)")
txt
}<- paste(txt, "\nCV (intra) ",
txt sprintf("%6.4f", CV.TR.est))
if (power) {
<- suppressMessages(
post.hoc power.TOST(CV = CV.TR.est,
theta0 = TR.est,
n = n))
<- paste(txt, "\nPost hoc power",
txt sprintf("%6.4f", post.hoc))
}cat(txt, "\n")
}
}sim.HOXover(treatments = 4, sequences = 4,
T = c("A", "B", "C"), R = "D",
theta0 = c(0.93, 0.97, 0.99),
CV = c(0.20, 0.22, 0.25), n = 16,
nested = FALSE, details = FALSE,
show.data = FALSE, power = TRUE,
setseed = TRUE)
```

```
# PE A/D 0.9395
# 90% CI 0.8407 – 1.0499 (pass)
# CV (intra) 0.1789
# Post hoc power 0.7813
# PE B/D 0.9296
# 90% CI 0.8200 – 1.0539 (pass)
# CV (intra) 0.2011
# Post hoc power 0.6402
# PE C/D 0.9164
# 90% CI 0.7957 – 1.0555 (fail)
# CV (intra) 0.2271
# Post hoc power 0.4751
```

Based on the Point Estimates the most promising formulation is
candidate \(\small{A}\). Based on
*post hoc* power we would come to the same conclusion. However,
this is not always the case. Run the script with the argument
`setseed = FALSE`

for other simulations (up to five
treatments in Williams’ designs and Latin Squares are supported).

I would always base the decision on the
PE and *CV* because power
depends on both and hence, is less informative.

A bit provocative: Say, we have three studies with *exactly*
the same *post hoc* power.

```
<- function(x) {
opt power.TOST(CV = x, theta0 = pe, n = n) - comp$power[1]
}<- c(0.95, 0.975, 0.90)
theta0 <- c(0.25, NA, NA)
CV <- sampleN.TOST(CV = CV[1], theta0 = theta0[1],
n print = FALSE)[["Sample size"]]
<- data.frame(study = 1:3, theta0 = theta0, CV = CV,
comp lower = NA_real_, upper = NA_real_,
power = NA_real_, BE = "fail",
n = NA_integer_)
for (i in 1:nrow(comp)) {
<- theta0[i]
pe if (i > 1) {
3] <- uniroot(opt, interval = c(0.01, 1),
comp[i, tol = 1e-8)$root
}4:5] <- CI.BE(CV = comp$CV[i], pe = pe, n = n)
comp[i, 6] <- power.TOST(CV = comp[i, 3], theta0 = pe, n = n)
comp[i, if (round(comp[i, 4], 4) >= 0.8 & round(comp[i, 5], 4) <= 1.25)
7] <- "pass"
comp[i, 8] <- sampleN.TOST(CV = comp$CV[i], theta0 = theta0[i],
comp[i, print = FALSE)[["Sample size"]]
}3:6] <- signif(comp[, 3:6], 4)
comp[, <- comp[, c("study", "power", "CV", "theta0",
comp "lower", "upper", "BE", "n")]
names(comp)[4] <- "PE"
print(comp, row.names = FALSE)
```

```
# study power CV PE lower upper BE n
# 1 0.8074 0.2500 0.950 0.8491 1.0630 pass 28
# 2 0.8074 0.2737 0.975 0.8626 1.1020 pass 28
# 3 0.8074 0.1720 0.900 0.8326 0.9728 pass 28
```

Now what? A naïve sample size estimation for the next study will
suggest the same number of subjects… We know that power is more
sensitive to the T/R-ratio than to the *CV*. Should we trust most
in the second study because of the best T/R-ratio despite it showed the
largest *CV*?

Maybe a Bayesian approach helps by taking the uncertainty of estimates into account.

```
<- function(x) {
opt power.TOST(CV = x, theta0 = theta0[i], n = n) - power
}<- c(0.95, 0.975, 0.90)
theta0 <- c(0.25, NA, NA)
CV <- sampleN.TOST(CV = CV[1], theta0 = theta0[1], print = FALSE)
tmp <- tmp[["Sample size"]]
n <- tmp[["Achieved power"]]
power <- data.frame(study = 1:3, theta0 = theta0, CV = CV)
comp for (i in 2:nrow(comp)) {
3] <- uniroot(opt, interval = c(0.01, 1), tol = 1e-8)$root
comp[i,
}<- list(m = n, design = "2x2x2")
priors <- data.frame(study = 1:3, theta0 = theta0, CV = comp$CV,
bayes n1 = NA_integer_, n2 = NA_integer_, n3 = NA_integer_)
for (i in 1:nrow(comp)) {
$n1[i] <- expsampleN.TOST(CV = comp$CV[i], theta0 = theta0[i],
bayesprior.type = c("CV"), prior.parm = priors,
print = FALSE, details = FALSE)[["Sample size"]]
$n2[i] <- expsampleN.TOST(CV = comp$CV[i], theta0 = theta0[i],
bayesprior.type = c("theta0"), prior.parm = priors,
print = FALSE, details = FALSE)[["Sample size"]]
$n3[i] <- expsampleN.TOST(CV = comp$CV[i], theta0 = theta0[i],
bayesprior.type = c("both"), prior.parm = priors,
print = FALSE, details = FALSE)[["Sample size"]]
}print(signif(bayes, 4), row.names = FALSE)
```

```
# study theta0 CV n1 n2 n3
# 1 0.950 0.2500 30 40 44
# 2 0.975 0.2737 32 42 48
# 3 0.900 0.1720 30 38 40
```

`n1`

is the sample size taking the uncertainty of
*CV* into account, `n2`

the one taking the uncertainty
of the T/R-ratio into account, and `n3`

takes both
uncertainties into account. In this case the low *CV* in the
third study outweighs the worst T/R-ratio. That’s cleary more
informative than just looking at *post hoc* power (which was the
same in all studies) and even the T/R-ratio and the *CV*
alone.

Coming back to the question asked in the Introduction. To repeat:

What is the purpose of calculating
*post hoc* Power?

There is none.

*Post hoc*power is not relevant in the regulatory decision process. The patient’s risk*α*is independent from the producer’s risk*β*. Regulators should be concerned only about the former.

- ‘High’
*post hoc*power does not further support an already passing study. - ‘Low’
*post hoc*power does not refute its outcome.

- ‘High’

**To avoid rather fruitless discussions, don’t \(\small{}\tfrac{\textsf{calculate}}{\textsf{report}}\)
post hoc power at all.**

- If you are asked by a regulatory assessor for a ‘justification’ of
low or high
*post hoc*power, answer in a diplomatic way. Keep calm and stay polite.

top of section ↩︎ previous section ↩︎

Licenses

Helmut Schütz 2022

`R`

, `PowerTOST`

, and
`randomizeBE`

GPL 3.0,
`pandoc`

GPL 2.0.

1^{st} version April 28, 2021. Rendered July 26, 2022 09:49 CEST
by rmarkdown
via pandoc in 0.78 seconds.

Footnotes and References

Labes D, Schütz H, Lang B.

*PowerTOST: Power and Sample Size for (Bio)Equivalence Studies.*Package version 1.5.4. 2022-02-21. CRAN.↩︎Labes D.

*randomizeBE: Create a Random List for Crossover Studies.*Package version 0.3.5. 2019-08-24. CRAN.↩︎Labes D, Schütz H, Lang B.

*Package ‘PowerTOST’.*February 21, 2022. CRAN.↩︎Zehetmayer S, Posch M.

*Post-hoc power estimation in large-scale multiple testing problems.*Bioinformatics. 2010; 26(8): 1050–6. 10.1093/bioinformatics/btq085.↩︎Hoenig JM, Heisey DM.

*The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis.*Am Stat. 2010; 55(1): 19–24. doi:10.1198/000313001300339897. Open Access.↩︎Lenth RL.

*Some Practical Guidelines for Effective Sample Size Determination.*Am Stat. 2010; 55(3): 187–93. doi:10.1198/000313001317098149.↩︎Senn S.

*Power is indeed irrelevant in interpreting completed studies.*BMJ. 2002. 325(7375); 1304. PMID 12458264. Free Full Text.↩︎Lenth RV.

*Two Sample-Size Practices that I Don’t Recommend.*October 24, 2000. Online.↩︎EMEA, CPMP.

*Note for Guidance on the Investigation of Bioavailability and Bioequivalence.*CPMP/EWP/QWP/1401/98. Section 3.1 Design. London. 26 July 2001. Online.↩︎EMA, CHMP.

*Guideline on the Investigation of Bioequivalence.*CPMP/EWP/QWP/1401/98 Rev. 1/ Corr **. Section 4.1.3 Number of subjects. London. 20 January 2010. Online.↩︎Schuirmann DJ.

*A comparison of the Two One-Sided Tests Procedure and the Power Approach for Assessing the Equivalence of Average Bioavailability.*J Pharmacokin Biopharm. 1987; 15(6): 657–80. doi:10.1007/BF01068419.↩︎WHO.

*Frequent deficiencies in BE study protocols.*Geneva. November 2020. Online.↩︎Potvin D, DiLiberti CE, Hauck WW, Parr AF, Schuirmann DJ, Smith RA.

*Sequential design approaches for bioequivalence studies with crossover designs.*Pharm Stat. 2008; 7(4): 245–62. doi:10.1002/pst.294.↩︎Berger RL, Hsu JC.

*Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets.*Stat Sci. 1996; 11(4): 283–302. JSTOR:2246021.↩︎Zeng A.

*The TOST confidence intervals and the coverage probabilities with R simulation.*March 14, 2014. Online.↩︎It will be the same only if the Point Estimate is

*exactly*1.↩︎Certara USA, Inc. Princeton, NJ.

*Power of the two one-sided t-tests procedure.*7/9/20. Online.↩︎Certara USA, Inc. Princeton, NJ.

*Phoenix WinNonlin*. 2022. Online.↩︎I bombarded Pharsight/Certara for years asking for an update. I was a beta-tester of Phoenix v6.0–6.4. Guess who suggested which software to compare to.↩︎

Approximation by the noncentral

*t*-distribution.↩︎Skelly JP.

*A History of Biopharmaceutics in the Food and Drug Administration 1968–1993.*AAPS J. 2010; 12(1): 44–50. doi:10.1208/s12248-009-9154-8. Free Full Text.↩︎In a nutshell: Products with a large difference and small variabilty passed, whereas ones with a small difference and large variability failed. That’s counterintuitive and the opposite of what we want.↩︎

Certara USA, Inc. Princeton, NJ.

*Power for 80/20 Rule.*7/9/20. Online.↩︎