Consider allowing JavaScript. Otherwise, you have to be proficient in reading since formulas will not be rendered. Furthermore, the table of contents in the left column for navigation will not be available and code-folding not supported. Sorry for the inconvenience.
Examples in this article were generated with
4.3.1 by the package PowerTOST
.^{1}
More examples are given in the respective vignette.^{2} See also the README on GitHub for an overview and the Online manual^{3} for details.
Abbreviation | Meaning |
---|---|
\(\small{\alpha}\) | Nominal level of the test, probability of Type I Error, patient’s risk |
\(\small{\beta}\) | Probability of Type II Error, producer’s risk |
BE | Bioequivalence |
\(\small{CV}\) | Total (pooled) Coefficient of Variation |
\(\small{CV_\text{inter}}\) | Between-subject Coefficient of Variation |
\(\small{CV_\text{intra}}\) | Within-subject Coefficient of Variation |
\(\small{H_0}\) | Null hypothesis |
\(\small{H_1}\) | Alternative hypothesis (also \(\small{H_\text{a}}\)) |
NTID | Narrow Therapeutic Index Drug |
\(\small{\pi}\) | Prospective power (\(\small{1-\beta}\)) |
TOST | Two One-Sided Tests |
What are the main statistical issues in planning a confirmatory experiment?
For details about inferential statistics and hypotheses in equivalence see another article.
An ‘optimal’ study design is one, which – taking all assumptions into account – has a reasonably high chance of demonstrating equivalence (power) whilst controlling the patient’s risk.
A basic knowledge of R is
required. To run the scripts at least version 1.4.3 (2016-11-01) of
PowerTOST
is suggested. Any version of R would likely do, though the current release of
PowerTOST
was only tested with version 4.2.3 (2023-03-15)
and later.
All scripts were run on a Xeon E3-1245v3 @ 3.40GHz (1/4 cores) 16GB RAM
with R 4.3.1 on Windows 7 build 7601, Service
Pack 1, Universal C Runtime 10.0.10240.16390.
Note that in all functions of PowerTOST
the arguments
(say, the assumed T/R-ratio theta0
, the BE-limits
(theta1
, theta2
), the assumed coefficient of
variation CV
, etc.) have to be given as ratios and not in
percent.
sampleN.TOST()
assumes two equally sized groups.
Furthermore, the estimated sample size is the total number of
subjects, which is always an even number (not subjects per group – like
in some other software packages).
Most examples deal with studies where the response variables likely
follow a log-normal
distribution, i.e., we assume a multiplicative model
(ratios instead of differences). We work with \(\small{\log_{e}}\)-transformed data in
order to allow analysis by a variant of the t-test (requiring
differences). This is the default in most functions of
PowerTOST
and hence, the argument
logscale = TRUE
does not need to be specified.
It may sound picky but ‘sample size calculation’ (as used in most guidelines and alas, in some publications and textbooks) is sloppy terminology. In order to get prospective power (and hence, a sample size), we need five values:
1 – 2 are fixed by the agency,
3 is set by the sponsor (commonly to 0.80 – 0.90), and
4 – 5 are just (uncertain!) assumptions.
In other words, obtaining a sample size is not an exact calculation like \(\small{2\times2=4}\) but always just an estimation.
“Power Calculation – A guess masquerading as mathematics.
Of note, it is extremely unlikely that all assumptions will be exactly realized in a particular study. Hence, calculating retrospective (a.k.a. post hoc, a posteriori) power is not only futile but plain nonsense.^{5}
Contrary to non-inferiority trials (where we want to demonstrate superiority of verum to placebo) or first-in-human studies, in equivalence studies the allocation ratio is generally 1, i.e., we plan for equally sized treatment groups. For other approaches see the respective section.
We should not talk about between subject variability (\(\small{CV_\text{inter}}\)) in parallel designs unless we have this information from a crossover design.^{6} What we see in a parallel design is the total variability^{7} ^{8} (pooled from within-subject \(\small{s_\text{intra}^2}\) and between-subject variances \(\small{s_\text{inter}^2}\)): \[\eqalign{CV_\text{intra}&=\sqrt{\exp\left(s_\text{intra}^2\right)-1}\\ CV_\text{inter}&=\sqrt{\exp\left(\frac{s_\text{inter}^2-s_\text{intra}^2}{2}\right)-1}\\ CV_\text{total}&=\sqrt{\exp\left(\frac{s_\text{inter}^2+s_\text{intra}^2}{2}\right)-1}}\tag{1}\] The fact that we observed only one occasion in a parallel design does not mean that the within-subject variability miraculously disappears.
Let’s calculate \(\small{CV_\text{total}}\) for some combinations of \(\small{CV_\text{intra}}\) (0.075 – 0.25) and \(\small{CV_\text{inter}}\) (0.20 – 0.50).
<- function(CV) return(log(CV^2 + 1))
CV2var <- function(var) return(sqrt(exp(var) - 1))
var2CV <- c(seq(0.2, 0.4, 0.05), 0.5)
CV.b <- c(seq(0.075, 0.2, 0.025), 0.25)
CV.w <- data.frame(CV.b = rep(CV.b,
x each = length(CV.w)),
CV.w = CV.w, CV = NA)
for (i in 1:nrow(x)) {
# pool variances and convert to CV acc. to Eq. (1)
<- CV2var(x$CV.w[i]) # within-subject (intra)
MSE.w <- CV2var(x$CV.b[i]) * 2 + MSE.w # between-subject (inter)
MSE.b <- (MSE.b + MSE.w) / 2 # total
MSE.t $CV[i] <- var2CV(MSE.t)
x
}<- reshape(x, direction = "wide",
x idvar = "CV.w", timevar = "CV.b")
names(x)[2:ncol(x)] <- paste0("CV.b.", CV.b)
print(signif(x, 5), row.names = FALSE)
# CV.w CV.b.0.2 CV.b.0.25 CV.b.0.3 CV.b.0.35 CV.b.0.4 CV.b.0.5
# 0.075 0.21413 0.26168 0.31005 0.35891 0.40807 0.50698
# 0.100 0.22450 0.27042 0.31765 0.36568 0.41425 0.51235
# 0.125 0.23717 0.28125 0.32716 0.37422 0.42205 0.51916
# 0.150 0.25179 0.29395 0.33842 0.38439 0.43139 0.52738
# 0.175 0.26805 0.30828 0.35126 0.39608 0.44218 0.53692
# 0.200 0.28566 0.32404 0.36551 0.40915 0.45431 0.54772
# 0.250 0.32404 0.35904 0.39765 0.43893 0.48218 0.57282
In general \(\small{CV_\text{intra}<CV_\text{inter}}\), explaining why crossover studies are so popular. The efficiency of a crossover study compared to a parallel study is given by \(\small{\frac{\sigma_\text{intra}^2\,+\,\sigma_\text{inter}^2}{0.5\,\times\,\sigma_\text{intra}^2}}\). If, say, \(\small{\sigma_\text{intra}^2=0.5\times\sigma_\text{inter}^2}\) in a paralled study we need six times as many subjects than in a crossover to obtain the same power. On the other hand, in a crossover we have two measurements per subject, which makes the parallel study approximately three times more costly.
Studies in patients with an instable disease (e.g., in oncology) must be performed in a parallel design. If crossovers are not feasible (e.g., for drugs with a very long half life), studies could be performed in a parallel design as well.
Of note, there is no relationship between \(\small{CV_\text{intra}}\) and \(\small{CV_\text{inter}}\).
An example are drugs which are subjected to polymorphic metabolism. For
these drugs \(\small{CV_\text{intra}\ll
CV_\text{inter}}\). On the other hand, some
HVD(P)s show
\(\small{CV_\text{intra}>CV_\text{inter}}\).
It is futile to hope that \(\small{CV_\text{intra}}\) and \(\small{CV_\text{inter}}\) can be obtained from a study in a parallel design. There is an infinite number of combinations of them giving the same \(\small{CV_\text{total}}\).
Let’s simulate some arbitrary combinations (\(\small{CV_\text{intra}}\) ~7.5 – 25% and \(\small{CV_\text{inter}}\) ~20 – 50%).
<- function(CV.inter, CV.intra) {
pooling <- function(CV) return(log(CV^2 + 1))
CV2var <- function(var) return(sqrt(exp(var) - 1))
var2CV <- CV2var(CV.intra)
MSE.w <- CV2var(CV.inter) * 2 + MSE.w
MSE.b <- (MSE.b + MSE.w) / 2
MSE.t return(var2CV(MSE.t))
}set.seed(123456) # for reproducibility
<- 300L # number of simulations
nsims # CV.intra and CV.inter uniformly distributed within their ranges
<- data.frame(CV.total = NA_real_,
sims CV.intra = round(runif(n = nsims,
min = 0.075,
max = 0.25), 4),
CV.inter = round(runif(n = nsims,
min = 0.20,
max = 0.50), 4),
duplicated = 0, count = 0,
CV.ratio = NA_real_)
for (i in 1:nrow(sims)) {# do it!
$CV.total[i] <- round(pooling(sims$CV.intra[i], sims$CV.inter[i]), 4)
sims
}# For beginners: Transparent rather than elegant
# If you are are a neRd, explore the function duplicated() instead
for (i in 1:(nrow(sims) - 1)) {
<- 0
dup for (j in (i + 1):nrow(sims)) {
if (sims$CV.total[j] == sims$CV.total[i]) dup <- dup + 1
}$duplicated[i] <- dup
sims
}<- sims[with(sims, order(CV.total, CV.intra)), ]
sims <- sims$CV.total[which(sims$duplicated >= 1)]
dup.CV.total <- sims[sims$CV.total %in% dup.CV.total, c(1:3, 5)]
duplicates for (i in seq_along(dup.CV.total)) {
<- duplicates[duplicates$CV.total == dup.CV.total[i], ]
tmp <- nrow(tmp)
dups $count[which(duplicates$CV.total %in% tmp$CV.total)] <-
duplicatesrep(dups, dups)
}$CV.ratio <- duplicates$CV.intra / duplicates$CV.total
duplicatesnames(duplicates)[5] <- "intra/total"
# Increasing order by CV.total and CV.intra
<- duplicates[order(duplicates$CV.total, duplicates$CV.intra), ]
duplicates print(duplicates[c(1:3, 5)], row.names = FALSE)
# CV.total CV.intra CV.inter intra/total
# 0.2958 0.0853 0.2822 0.2883705
# 0.2958 0.1995 0.2142 0.6744422
# 0.3127 0.1389 0.2775 0.4441957
# 0.3127 0.1708 0.2582 0.5462104
# 0.3268 0.0987 0.3100 0.3020196
# 0.3268 0.2047 0.2496 0.6263770
# 0.3269 0.0752 0.3172 0.2300398
# 0.3269 0.1382 0.2935 0.4227593
# 0.3532 0.1656 0.3078 0.4688562
# 0.3532 0.2286 0.2625 0.6472254
# 0.3578 0.0976 0.3426 0.2727781
# 0.3578 0.1658 0.3128 0.4633874
# 0.3615 0.1174 0.3396 0.3247580
# 0.3615 0.1757 0.3112 0.4860304
# 0.3700 0.1082 0.3518 0.2924324
# 0.3700 0.2262 0.2856 0.6113514
# 0.3768 0.0982 0.3620 0.2606157
# 0.3768 0.2209 0.2981 0.5862527
# 0.3870 0.1040 0.3708 0.2687339
# 0.3870 0.1815 0.3363 0.4689922
# 0.3911 0.1348 0.3638 0.3446689
# 0.3911 0.1711 0.3466 0.4374840
# 0.4255 0.1155 0.4068 0.2714454
# 0.4255 0.1499 0.3938 0.3522914
# 0.4303 0.1657 0.3918 0.3850802
# 0.4303 0.2110 0.3669 0.4903556
# 0.4535 0.0851 0.4438 0.1876516
# 0.4535 0.1452 0.4252 0.3201764
# 0.4621 0.1687 0.4242 0.3650725
# 0.4621 0.2158 0.3994 0.4669985
# 0.4765 0.1435 0.4498 0.3011542
# 0.4765 0.2244 0.4102 0.4709339
# 0.4992 0.1019 0.4862 0.2041266
# 0.4992 0.1540 0.4693 0.3084936
Do you get the point? It’s like asking a pupilLeaves the pupil – rightly – dazed and confused.
As an aside, it shows also that the ‘rule of thumb’ – assuming that \(\small{CV_\text{intra}}\) in a 2×2×2 crossover study will be \(\small{\sim\tfrac{1}{2}\,CV_\text{total}}\) of a parallel study – is nonsense. In this example we obtained \(\small{\tfrac{CV_\text{intra}}{CV_\text{total}}}\) of 0.1877 – 0.6744 (\(\small{\widetilde{x}=}\) 0.3751).
The sample size cannot be
directly estimated,
only power calculated for an already given sample
size.
The power equations cannot be re-arranged to solve for sample size.
“Power. That which statisticians are always calculating but never have.
If you are a geek out in the wild and have only your scientific calculator with you, use the large sample approximation (i.e., based on the normal distribution of \(\small{\log_{e}}\)-transformed data), taking power and \(\small{\alpha}\) into account.^{10}
Convert the coefficient of variation to the variance \[\begin{equation}\tag{2} \sigma^2=\log_{e}(CV^2+1) \end{equation}\] Numbers needed in the following: \(\small{Z_{1-0.2}=0.8416212}\), \(\small{Z_{1-0.1}=1.281552}\), \(\small{Z_{1-0.05}=1.644854}\). \[\begin{equation}\tag{3a} \text{if}\;\theta_0<1:n\sim\frac{2\,\sigma^2(Z_{1+\beta}+Z_{1-\alpha})^2}{\left(\log_{e}(\theta_0)-\log_{e}(\theta_1)\right)^2} \end{equation}\] \[\begin{equation}\tag{3b} \text{if}\;\theta_0>1:n\sim\frac{2\,\sigma^2(Z_{1+\beta}+Z_{1-\alpha})^2}{\left(\log_{e}(\theta_0)-\log_{e}(\theta_2)\right)^2} \end{equation}\] \[\begin{equation}\tag{3c} \text{if}\;\theta_0=1:n\sim\frac{2\,\sigma^2(Z_{1+\beta/2}+Z_{1-\alpha})^2}{\left(\log_{e}(\theta_2)\right)^2} \end{equation}\] Generally you should use \(\small{(3\text{a})}\) or \(\small{(3\text{b})}\). Only if you are courageous, use \(\small{(3\text{c})}\). However, I would not recommend that.^{11}
\(\small{(3)}\) gives the sample size (it can be a real number) per group. We have to multiply it by two and round up to the next even integer.
In the variants of \(\small{(3)}\) we assume that the
population variance \(\small{\sigma^2}\) is known, which is never
the case. In practice is is uncertain because only its estimate
\(\small{s^2}\) is known. Therefore, \(\small{(3)}\) will
always give a sample size which is too low:
For \(\small{CV=0.40}\), \(\small{\theta_0=0.95}\), \(\small{\alpha=0.05}\), \(\small{\beta=0.20}\), and the conventional
limits we get 124.29 for the total sample size (rounded up to 126). As
we will see later, the correct sample size is 130.
“Power: That which is wielded by the priesthood of clinical trials, the statisticians, and a stick which they use to beta their colleagues.
Since we evaluate studies with a variant of the t-test, we
have to take the degrees of freedom (in a parallel design \(\small{df=n-2)}\) into account.
Power depends on the degrees of freedom, which themselves depend on the
sample size.
Hence, the power equations are used to estimate the sample size in an iterative way.
Example based on the noncentral t-distribution (warning: 100 lines of code).
<- function(n) {# equal group sizes
make.even return(as.integer(2 * (n %/% 2 + as.logical(n %% 2))))
}<- 0.40 # total (pooled) CV
CV <- 0.95 # T/R-ratio
theta0 <- 0.80 # lower BE-limit
theta1 <- 1.25 # upper BE-limit
theta2 <- 0.80 # desired (target) power
target <- 0.05 # level of the test
alpha <- 1 - target # producer's risk
beta <- data.frame(method = character(),
df iteration = integer(),
n = integer(),
power = numeric())
<- log(CV^2 + 1)
s2 <- sqrt(s2)
s <- qnorm(1 - alpha)
Z.alpha <- qnorm(1 - beta)
Z.beta .2 <- qnorm(1 - beta / 2)
Z.beta# large sample approximation
if (theta0 == 1) {
<- 2 * s2 * (Z.beta.2 + Z.alpha)^2
num <- ceiling(num / log(theta2)^2)
n.grp else {
} <- 2 * s2 * (Z.beta + Z.alpha)^2
num if (theta0 < 1) {
<- (log(theta0) - log(theta1))^2
denom else {
}<- (log(theta0) - log(theta2))^2
denom
}<- ceiling(num / denom)
n.grp
}<- make.even(2 * n.grp)
n <- pnorm(sqrt((log(theta0) - log(theta2))^2 * n.grp/
power 2 * s2)) - Z.alpha) +
(pnorm(sqrt((log(theta0) - log(theta1))^2 * n.grp/
2 * s2)) - Z.alpha) - 1
(1, 1:4] <- c("large sample", "\u2013",
df[sprintf("%.6f", power))
n, # check by central t approximation
# (since the variance is unknown)
<- qt(1 - alpha, n - 2)
t.alpha if (theta0 == 1) {
<- 2 * s2 * (Z.beta.2 + t.alpha)^2
num <- ceiling(num / log(theta2)^2)
n.grp else {
} <- 2 * s2 * (Z.beta + t.alpha)^2
num if (theta0 < 1) {
<- (log(theta0) - log(theta1))^2
denom else {
}<- (log(theta0) - log(theta2))^2
denom
}<- ceiling(num / denom)
n.grp
}<- make.even(2 * n.grp)
n <- qt(1 - alpha, n - 2)
t.alpha <- pnorm(sqrt((log(theta0) - log(theta2))^2 * n.grp/
power 2 * s2)) - t.alpha) +
(pnorm(sqrt((log(theta0) - log(theta1))^2 * n.grp/
2 * s2)) - t.alpha) - 1
(2, 1:4] <- c("central t", "\u2013",
df[sprintf("%.6f", power))
n, <- 1
i if (power < target) {# iterate upwards
repeat {
<- sqrt(4 / n) * s
sem <- c((log(theta0) - log(theta1)) / sem,
ncp log(theta0) - log(theta2)) / sem)
(<- diff(pt(c(+1, -1) * qt(1 - alpha,
power df = n - 2),
df = n - 2, ncp = ncp))
+ 2, 1:4] <- c("noncentral t", i,
df[i sprintf("%.6f", power))
n, if (power >= target) {
break
else {
}<- i + 1
i <- n + 2
n
}
}else { # iterate downwards
} repeat {
<- sqrt(4 / n) * s
sem <- c((log(theta0) - log(theta1)) / sem,
ncp log(theta0) - log(theta2)) / sem)
(<- diff(pt(c(+1, -1) * qt(1 - alpha,
power df = n - 2),
df = n - 2, ncp = ncp))
+ 2, 1:4] <- c("noncentral t", i,
df[i sprintf("%.6f", power))
n, if (power < target) {
<- df[-nrow(df), ]
df break
else {
}<- i + 1
i <- n - 2
n
}
}
}print(df, row.names = FALSE)
# method iteration n power
# large sample – 126 0.795445
# central t – 126 0.791696
# noncentral t 1 126 0.791080
# noncentral t 2 128 0.797396
# noncentral t 3 130 0.803512
Now it’s high time to leave Base R
behind and continue with
PowerTOST
.
Makes our life much easier.
library(PowerTOST) # attach it to run the examples
Its sample size functions use a modification of Zhang’s method^{13} based
on the large sample approximation as the starting value of the
iterations. You can unveil the course of iterations with
details = TRUE
.
sampleN.TOST(CV = 0.40, theta0 = 0.95,
targetpower = 0.80,
design = "parallel", details = TRUE)
#
# +++++++++++ Equivalence test - TOST +++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups
# Design characteristics:
# df = n-2, design const. = 4, step = 2
#
# log-transformed data (multiplicative model)
#
# alpha = 0.05, target power = 0.8
# BE margins = 0.8 ... 1.25
# True ratio = 0.95, CV = 0.4
#
# Sample size search (ntotal)
# n power
# 128 0.797396
# 130 0.803512
#
# Exact power calculation with
# Owen's Q functions.
Only two iterations. In many cases it hits the bull’s eye right away, i.e., already with the first guess.
Never (ever!) estimate the sample size based on the large
sample approximation \(\small{(3)}\).
We evaluate our studies based on a variant of the t-test,
right?
The sample size based on the normal distribution may be too small, thus
compromising power.
# method n Delta RE power
# large sample approx. (2) 126 -4 -3.1% 0.791080
# noncentral t-distribution 130 ±0 ±0.0% 0.803512
# exact method (Owen’s Q) 130 – – 0.803512
Throughout the examples by ‘group’ I’m referring to ‘treatment arms’ – not multiple groups within them or multicenter studies. That’s another cup of tea.
Most methods of PowerTOST
are based on pairwise
comparisons. It is up to you to adjust the level of the test
alpha
if you want to compare more (say, two test treatments
v.s. a reference or each of them against one of the others) in
order to avoid inflation of the family-wise
error rate due to multiplicity. Examples are given further down.
We assume a CV of 0.40, a T/R-ratio of 0.95, and target a power of 0.80.
Since theta0 = 0.95
and targetpower = 0.80
are defaults of the function, we don’t have to give them explicitly. As
usual in bioequivalence, alpha = 0.05
is employed (we will
assess the study by a \(\small{100\,(1-2\,\alpha)=90\%}\)
confidence interval) and the
BE-limits are
theta1 = 0.80
and theta2 = 1.25
. Since they
are also defaults of the function, we don’t have to give them as well.
Hence, the call of the function reduces to a one-liner.
sampleN.TOST(CV = 0.40, design = "parallel")
#
# +++++++++++ Equivalence test - TOST +++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups
# log-transformed data (multiplicative model)
#
# alpha = 0.05, target power = 0.8
# BE margins = 0.8 ... 1.25
# True ratio = 0.95, CV = 0.4
#
# Sample size (total)
# n power
# 130 0.803512
Sometimes we are not interested in the entire output and want to use
only a part of the results in subsequent calculations. We can suppress
the output by the argument print = FALSE
and assign the
result to a data frame (here named x
).
<- sampleN.TOST(CV = 0.40, design = "parallel", print = FALSE) x
Although you could access the elements by the number of the column(s), I don’t recommend that since in various functions these numbers are different and hence, difficult to remember.
Let’s retrieve the column names of x
:
names(x)
# [1] "Design" "alpha" "CV"
# [4] "theta0" "theta1" "theta2"
# [7] "Sample size" "Achieved power" "Target power"
Now we can access the elements of x
by their names. Note
that double square brackets [[…]]
have to be used.
"Sample size"]]
x[[# [1] 130
"Achieved power"]]
x[[# [1] 0.803512
Although you could access the elements by the number of the
column(s), I don’t recommend that, since in various other functions of
PowerTOST
these numbers are different and hence, difficult
to remember. If you insist in accessing elements by column-numbers, use
single square brackets […]
. You can also use a vector.
c(1, 7:8)]
x[# Design Sample size Achieved power
# 1 parallel 130 0.803512
With 130 subjects (65 per group) we achieve the power we desire.
What happens if we have one dropout?
power.TOST(CV = 0.40, design = "parallel",
n = x[["Sample size"]] - 1)
# Unbalanced design. n(i)=65/64 assumed.
# [1] 0.8004552
Still above the 0.80 we desire. However, with two dropouts (128 eligible subjects) we would slightly fall short (0.7974).
Interlude:
CV
/ theta0
-combinations
Sometimes we are interested in combinations of assumed values. Since
sampleN.TOST()
is not vectorized, a gimmick:
<- function(CVs, theta0s, ...) {
sampleN.TOST.vectorized # Returns a list with two matrices (n and power),
# rows theta0 and columns CV
<- power <- matrix(ncol = length(CVs), nrow = length(theta0s))
n for (i in seq_along(theta0s)) {
for (j in seq_along(CVs)) {
<- sampleN.TOST(CV = CVs[j], theta0 = theta0s[i], ...)
tmp <- tmp[["Sample size"]]
n[i, j] <- tmp[["Achieved power"]]
power[i, j]
}
}# cosmetics
<- function(x) match(TRUE, round(x, 1:15) == x)
DecPlaces <- paste0("CV=%.", max(sapply(CVs, FUN = DecPlaces),
fmt.col na.rm = TRUE), "f")
<- paste0("theta0=%.", max(sapply(theta0s, FUN = DecPlaces),
fmt.row na.rm = TRUE), "f")
colnames(power) <- colnames(n) <- sprintf(fmt.col, CVs)
rownames(power) <- rownames(n) <- sprintf(fmt.row, theta0s)
<- list(n = n, power = power)
res return(res)
}
<- sampleN.TOST.vectorized(CV = seq(0.30, 0.50, 0.05),
x theta0 = seq(0.90, 0.95, 0.01),
design = "parallel", print = FALSE)
cat("Sample size\n")
print(x$n)
cat("\nAchieved power\n")
print(signif(x$power, 5))
# Sample size
# CV=0.30 CV=0.35 CV=0.40 CV=0.45 CV=0.50
# theta0=0.90 156 208 266 332 400
# theta0=0.91 130 174 224 278 334
# theta0=0.92 112 148 190 236 284
# theta0=0.93 96 128 164 204 246
# theta0=0.94 86 114 144 180 216
# theta0=0.95 76 102 130 160 194
#
# Achieved power
# CV=0.30 CV=0.35 CV=0.40 CV=0.45 CV=0.50
# theta0=0.90 0.80227 0.80107 0.80008 0.80202 0.80075
# theta0=0.91 0.80060 0.80091 0.80234 0.80237 0.80017
# theta0=0.92 0.80473 0.80071 0.80127 0.80168 0.80009
# theta0=0.93 0.80171 0.80101 0.80100 0.80200 0.80119
# theta0=0.94 0.80866 0.80599 0.80088 0.80380 0.80122
# theta0=0.95 0.80312 0.80533 0.80351 0.80063 0.80200
Since dropouts are common (especially in studies in patients) it makes sense to include / dose more subjects in order to end up with a number of eligible subjects which is not lower than our initial estimate.
Let’s explore that in the next section.
We define two supportive functions:
n
will be rounded up to the next even number.<- function(n) {
up2even return(as.integer(2 * (n %/% 2 + as.logical(n %% 2))))
}
n
, and the anticipated dropout-rate
do.rate
.<- function(n, do.rate) {
nadj return(as.integer(up2even(n / (1 - do.rate))))
}
In order to come up with a suggestion we have to anticipate a (realistic!) dropout rate. Note that this not the job of the statistician; ask the Principal Investigator.
“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.
The dropout-rate is calculated from the eligible and
dosed subjects
or simply \[\begin{equation}\tag{4}
do.rate=1-n_\text{eligible}/n_\text{dosed}
\end{equation}\] Of course, we know it only after the
study was performed.
By substituting \(n_\text{eligible}\) with the estimated sample size \(n\), providing an anticipated dropout-rate and rearrangement to find the adjusted number of dosed subjects \(n_\text{adj}\) we should use \[\begin{equation}\tag{5} n_\text{adj}=\;\upharpoonleft n\,/\,(1-do.rate) \end{equation}\] where \(\upharpoonleft\) denotes rounding up to the next even number as implemented in the functions above.
An all too common mistake is to increase the estimated sample size \(n\) by the dropout-rate according to \[\begin{equation}\tag{6} n_\text{adj}=\;\upharpoonleft n\times(1+do.rate) \end{equation}\]
If you used \(\small{(6)}\) in the past – you are not alone. In a small survey a whopping 29% of respondents reported to use it.^{15} Consider changing your routine.
“There are no routine statistical questions, only questionable statistical routines.
In the following I specified more arguments to make the function more
flexible.
Note that I wrapped the function power.TOST()
in
suppressMessages()
. Otherwise, the function will throw for
any odd sample size a message telling us that the design is
unbalanced. Well, we know that.
<- "parallel"
design <- 0.40 # total (pooled) for parallel design
CV <- 0.80 # target (desired) power
target <- 0.95 # assumed T/R-ratio
theta0 <- 0.80 # lower BE limit
theta1 <- 1.25 # upper BE limit
theta2 <- 0.10 # anticipated dropout-rate 10%
do.rate <- sampleN.TOST(CV = CV, theta0 = theta0,
df theta1 = theta1,
theta2 = theta2,
targetpower = target,
design = design,
print = FALSE)
# calculate the adjusted sample size
<- nadj(df[["Sample size"]], do.rate)
n.adj # (decreasing) vector of eligible subjects
<- n.adj:df[["Sample size"]]
n.elig <- paste0("Design : ",
info
design,"\nAssumed CV : ",
CV,"\nAssumed T/R ratio : ",
theta0,"\nBE limits : ",
paste(theta1, theta2, sep = "\u2026"),
"\nTarget (desired) power : ",
target,"\nAnticipated dropout-rate: ",
do.rate,"\nEstimated sample size : ",
"Sample size"]], " (",
df[["Sample size"]]/2, "/group)",
df[["\nAchieved power : ",
signif(df[["Achieved power"]], 4),
"\nAdjusted sample size : ",
" (", n.adj/2, "/group)",
n.adj, "\n\n")
# explore the potential outcome for
# an increasing number of dropouts
<- signif((n.adj - n.elig) / n.adj, 4)
do.act <- data.frame(dosed = n.adj,
df eligible = n.elig,
dropouts = n.adj - n.elig,
do.act = do.act,
power = NA)
for (i in 1:nrow(df)) {
$power[i] <- suppressMessages(
dfpower.TOST(CV = CV,
theta0 = theta0,
theta1 = theta1,
theta2 = theta2,
design = design,
n = df$eligible[i]))
}cat(info)
print(round(df, 4), row.names = FALSE)
# Design : parallel
# Assumed CV : 0.4
# Assumed T/R ratio : 0.95
# BE limits : 0.8…1.25
# Target (desired) power : 0.8
# Anticipated dropout-rate: 0.1
# Estimated sample size : 130 (65/group)
# Achieved power : 0.8035
# Adjusted sample size : 146 (73/group)
#
# dosed eligible dropouts do.act power
# 146 146 0 0.0000 0.8461
# 146 145 1 0.0068 0.8437
# 146 144 2 0.0137 0.8413
# 146 143 3 0.0205 0.8389
# 146 142 4 0.0274 0.8364
# 146 141 5 0.0343 0.8339
# 146 140 6 0.0411 0.8313
# 146 139 7 0.0480 0.8287
# 146 138 8 0.0548 0.8261
# 146 137 9 0.0616 0.8234
# 146 136 10 0.0685 0.8207
# 146 135 11 0.0753 0.8180
# 146 134 12 0.0822 0.8152
# 146 133 13 0.0890 0.8123
# 146 132 14 0.0959 0.8094
# 146 131 15 0.1027 0.8065
# 146 130 16 0.1096 0.8035
In the worst case (16 dropouts) we end up with the originally estimated sample size of 130. Power preserved, mission accomplished. If we have fewer dropouts, splendid – we gain power.
If we would have adjusted the sample size acc. to \(\small{(6)}\) we
would have dosed only 144 subjects instead of 146 acc. to \(\small{(5)}\).
If the anticipated dropout rate of 10% is realized in the study, we
would have only 129 eligible subjects and power will be slightly
compromised (0.8005 instead of 0.8035). In this example we achieve still
more than our target power but the loss might be relevant in other
cases.
As said in the preliminaries, calculating post hoc power is futile.
“There is simple intuition behind results like these: If my car made it to the top of the hill, then it is powerful enough to climb that hill; if it didn’t, then it obviously isn’t powerful enough. Retrospective power is an obvious answer to a rather uninteresting question. A more meaningful question is to ask whether the car is powerful enough to climb a particular hill never climbed before; or whether a different car can climb that new hill. Such questions are prospective, not retrospective.
However, sometimes we are interested in it for planning the next study.
If you give and odd total sample size n
,
power.TOST()
will try to keep groups as balanced as
possible and show in a message how that was done.
<- 131
n.act signif(power.TOST(CV = 0.40, design = "parallel",
n = n.act), 6)
# Unbalanced design. n(i)=66/65 assumed.
# [1] 0.806475
Say, our study was more unbalanced. Let us assume that we dosed 146
subjects, the total number of subjects was also 131 but all dropouts
occurred in one group (unlikely but possible).
Instead of the total sample size n
we can give the number
of subjects of each group as a vector (the order is not relevant,
i.e., it does not matter which element refers to the test or
reference group).
<- 146
n.adj <- 131
n.act <- n.adj / 2
n.g1 <- n.act - n.g1
n.g2 <- signif(power.TOST(CV = 0.40, design = "parallel",
post.hoc n = c(n.g1, n.g2)), 6)
<- "%3.0f (%2.0f dropouts)"
fmt cat(paste0("Dosed subjects: ", n.adj,
"\nEligible : ",
sprintf(fmt, n.act, n.adj - n.act),
"\n Group 1 : ",
sprintf(fmt, n.g1, n.adj / 2 - n.g1),
"\n Group 2 : ",
sprintf(fmt, n.g2, n.adj / 2 - n.g2),
"\nPower : ", post.hoc, "\n"))
# Dosed subjects: 146
# Eligible : 131 (15 dropouts)
# Group 1 : 73 ( 0 dropouts)
# Group 2 : 58 (15 dropouts)
# Power : 0.801396
Of course, in a particular study you will provide the numbers in the
n
vector directly.
As stated already in the Introduction, the
CV and the T/R-ratio are only assumptions. Whatever
their origin might be (literature, previous studies) they carry some
degree of uncertainty. Hence, believing^{17} that they are the
true ones may be risky.
Some statisticians call that the ‘Carved-in-Stone’ approach.
Say, we performed a pilot study in 64 subjects and estimated the CV as 0.40.
The \(\small{\alpha}\) confidence interval of the CV is obtained via the \(\small{\chi^2}\)-distribution of its error variance \(\sigma^2\) with \(\small{n-2}\) degrees of freedom. \[\eqalign{\tag{7} s^2&=\log_{e}(CV^2+1)\\ L=\frac{(n-1)\,s^2}{\chi_{\alpha/2,\,n-2}^{2}}&\leq\sigma^2\leq\frac{(n-1)\,s^2}{\chi_{1-\alpha/2,\,n-2}^{2}}=U\\ \left\{lower\;CL,\;upper\;CL\right\}&=\left\{\sqrt{\exp(L)-1},\sqrt{\exp(U)-1}\right\} }\] Let’s calculate the 95% confidence interval of the CV to get an idea.
<- 64 # pilot study
m <- CVCL(CV = 0.40, df = m - 2,
x side = "2-sided", alpha = 0.05)
signif(x, 4)
# lower CL upper CL
# 0.3368 0.4941
Surprised? Although 0.40 is the best estimate for planning the next study, there is no guarantee that we will get exactly the same outcome. Since the \(\small{\chi^2}\)-distribution is skewed to the right, it is more likely that we will face a higher CV than a lower one in the pivotal study.
If we plan the study based on 0.40, we would opt for 130 subjects
like in the examples before (not adjusted for the dropout-rate).
If the CV will be lower, we gain power. Fine.
But what if it will be higher? Of course, we will loose power. But how
much?
Let’s explore what might happen at the confidence limits of the CV.
<- 64
m <- CVCL(CV = 0.40, df = m - 2,
x side = "2-sided", alpha = 0.05)
<- 130
n <- data.frame(CV = c(x[["lower CL"]], 0.40,
comp "upper CL"]]),
x[[power = NA)
for (i in 1:nrow(comp)) {
$power[i] <- power.TOST(CV = comp$CV[i],
compdesign = "parallel",
n = n)
}1] <- signif(comp[, 1], 4)
comp[, 2] <- signif(comp[, 2], 6)
comp[, print(comp, row.names = FALSE)
# CV power
# 0.3368 0.906998
# 0.4000 0.803512
# 0.4941 0.624098
Ouch!
What can we do? The larger the previous study was, the larger the degrees of freedom and hence, the narrower the confidence interval. In simple terms: The estimate is more certain. On the other hand, it also means that very small pilot studies are practically useless.
<- seq(8, 64, 8)
m <- data.frame(n = m, CV = 0.40,
x l = NA_real_, u = NA_real_)
for (i in 1:nrow(x)) {
3:4] <- CVCL(CV = 0.40, df = m[i] - 2,
x[i, side = "2-sided", alpha = 0.05)
}3:4] <- signif(x[, 3:4], 4)
x[, names(x)[3:4] <- c("lower CL", "upper CL")
print(x, row.names = FALSE)
# n CV lower CL upper CL
# 8 0.4 0.2521 1.0270
# 16 0.4 0.2878 0.6682
# 24 0.4 0.3047 0.5884
# 32 0.4 0.3153 0.5511
# 40 0.4 0.3228 0.5287
# 48 0.4 0.3285 0.5136
# 56 0.4 0.3330 0.5026
# 64 0.4 0.3368 0.4941
A Bayesian
method is implemented in PowerTOST
, which takes the
uncertainty of estimates into account. We can explore the uncertainty of
the CV or the T/R-ratio and of both.
<- 64 # sample size of pilot
m <- 0.40
CV <- 0.95
theta0 <- "parallel"
design <- expsampleN.TOST(CV = 0.4, theta0 = theta0,
x targetpower = 0.80,
design = design,
prior.parm = list(m = m,
design = design),
prior.type = "CV",
details = FALSE)
#
# ++++++++++++ Equivalence test - TOST ++++++++++++
# Sample size est. with uncertain CV
# -------------------------------------------------
# Study design: 2 parallel groups
# log-transformed data (multiplicative model)
#
# alpha = 0.05, target power = 0.8
# BE margins = 0.8 ... 1.25
# Ratio = 0.95
# CV = 0.4 with 62 df
#
# Sample size (ntotal)
# n exp. power
# 134 0.802364
Doesn’t really matter. Only 3% more subjects required. What about an uncertain T/R-ratio?
<- 64
m <- 0.40
CV <- 0.95
theta0 <- "parallel"
design <- expsampleN.TOST(CV = 0.4, theta0 = theta0,
x targetpower = 0.80,
design = design,
prior.parm = list(m = m,
design = design),
prior.type = "theta0",
details = FALSE)
#
# ++++++++++++ Equivalence test - TOST ++++++++++++
# Sample size est. with uncertain theta0
# -------------------------------------------------
# Study design: 2 parallel groups
# log-transformed data (multiplicative model)
#
# alpha = 0.05, target power = 0.8
# BE margins = 0.8 ... 1.25
# Ratio = 0.95
# CV = 0.4
#
# Sample size (ntotal)
# n exp. power
# 314 0.800778
Just terrible. The sample size more than doubles.
That should not take us by surprise. We don’t know where the true T/R-ratio lies but we can calculate the lower 95% confidence limit of the pilot study’s point estimate to get an idea about a worst case.
<- 64
m <- 0.40
CV <- 0.95
pe <- "parallel"
design <- round(CI.BE(CV = CV, pe = 0.95, n = m,
ci design = design), 4)
if (pe <= 1) {
<- ci[["lower"]]
cl else {
} <- ci[["upper"]]
cl
}print(cl)
# [1] 0.8089
Explore the impact of a relatively 5% higher CV and a relatively 5% lower T/R-ratio on power for a given sample size.
<- 130
n <- 0.40
CV <- 0.95
theta0 <- "parallel"
design <- data.frame(CV = c(CV, CV*1.05),
comp1 power = NA_real_)
<- data.frame(theta0 = c(theta0, theta0*0.95),
comp2 power = NA_real_)
for (i in 1:2) {
$power[i] <- power.TOST(CV = comp1$CV[i],
comp1theta0 = theta0,
n = n, design = design)
}$power <- signif(comp1$power, 6)
comp1for (i in 1:2) {
$power[i] <- power.TOST(CV = CV,
comp2theta0 = comp2$theta0[i],
n = n, design = design)
}$power <- signif(comp2$power, 6)
comp2print(comp1, row.names = FALSE)
print(comp2, row.names = FALSE)
# CV power
# 0.40 0.803512
# 0.42 0.766869
# theta0 power
# 0.9500 0.803512
# 0.9025 0.550772
Interlude: Power curves
Note the log-scale of the x-axis. It demonstrates that power curves are symmetrical around 1 (\(\small{\log_{e}(1)=0}\), where \(\small{\log_{e}(\theta_2)=\left|\log_{e}(\theta_1)\right|}\)) and we will achieve the same power for \(\small{\theta_0}\) and \(\small{1/\theta_0}\) (e.g., for 0.95 and 1.0526). Furthermore, all curves intersect 0.05 at \(\small{\theta_1}\) and \(\small{\theta_2}\), which means that for a true \(\small{\theta_0}\) at one of the limits \(\small{\alpha}\) is strictly controlled.
<nitpick><- 0.40
CV <- 0.05 # direction unknown
delta <- c(1 - delta, 1 / (1 + delta),
theta0s 1 + delta, 1 / (1 - delta))
<- sampleN.TOST(CV = CV, theta0 = 1 - delta,
n design = "parallel",
print = FALSE)[["Sample size"]]
<- data.frame(CV = CV, theta0 = theta0s,
comp1 base = c(TRUE, rep(FALSE, 3)),
n = n, power = NA_real_)
for (i in 1:nrow(comp1)) {
$power[i] <- power.TOST(CV = CV,
comp1theta0 = comp1$theta0[i],
design = "parallel",
n = n)
}<- sampleN.TOST(CV = CV, theta0 = 1 + delta,
n design = "parallel",
print = FALSE)[["Sample size"]]
<- data.frame(CV = CV, theta0 = theta0s,
comp2 base = c(FALSE, FALSE, TRUE, FALSE),
n = n, power = NA_real_)
for (i in 1:nrow(comp2)) {
$power[i] <- power.TOST(CV = CV,
comp2theta0 = comp2$theta0[i],
design = "parallel",
n = n)
}c(2, 5)] <- signif(comp1[, c(2, 5)] , 4)
comp1[, c(2, 5)] <- signif(comp2[, c(2, 5)] , 4)
comp2[, print(comp1, row.names = F); print(comp2, row.names = F)
# CV theta0 base n power
# 0.4 0.9500 TRUE 130 0.8035
# 0.4 0.9524 FALSE 130 0.8124
# 0.4 1.0500 FALSE 130 0.8124
# 0.4 1.0530 FALSE 130 0.8035
# CV theta0 base n power
# 0.4 0.9500 FALSE 126 0.7911
# 0.4 0.9524 FALSE 126 0.8001
# 0.4 1.0500 TRUE 126 0.8001
# 0.4 1.0530 FALSE 126 0.7911
</nitpick>
Since power is much more sensitive to the T/R-ratio than to the CV, the results obtained with the Bayesian method should be clear now.
Essentially this leads to the murky waters of prospective power and sensitivity analyses,
which are covered in other articles.
An appetizer to show the maximum deviations (CV, T/R-ratio and
decreased sample size due to dropouts) which give still a minimum
acceptable power of ≥ 0.70:
<- 0.40
CV <- 0.95
theta0 <- "parallel"
design <- 0.80
target <- 0.70
minpower <- pa.ABE(CV = CV, theta0 = theta0,
pa design = design,
targetpower = target,
minpower = minpower)
<- 100*(tail(pa$paCV[["CV"]], 1) -
change.CV $plan$CV) / pa$plan$CV
pa<- 100*(head(pa$paGMR$theta0, 1) -
change.theta0 $plan$theta0) /
pa$plan$theta0
pa<- 100*(tail(pa$paN[["N"]], 1) -
change.n $plan[["Sample size"]]) /
pa$plan[["Sample size"]]
pa<- data.frame(parameter = c("CV", "theta0", "n"),
comp change = c(change.CV,
change.theta0,
change.n))$change <- sprintf("%+.2f%%", comp$change)
compnames(comp)[2] <- "relative change"
print(pa, plotit = FALSE)
print(comp, row.names = FALSE)
# Sample size plan ABE
# Design alpha CV theta0 theta1 theta2 Sample size
# parallel 0.05 0.4 0.95 0.8 1.25 130
# Achieved power
# 0.803512
#
# Power analysis
# CV, theta0 and number of subjects leading to min. acceptable power of ~0.7:
# CV= 0.4552, theta0= 0.9276
# n = 103 (power= 0.701)
#
# parameter relative change
# CV +13.79%
# theta0 -2.36%
# n -20.77%
Confirms what we have seen above. Interesting that the sample size is the least sensitive one. Many overrate the impact of dropouts on power.
If you didn’t stop reading in desperation (understandable), explore both uncertain CV and T/R-ratio:
<- 64
m <- 0.40
CV <- 0.95
theta0 <- "parallel"
design <- expsampleN.TOST(CV = 0.4, theta0 = theta0,
df targetpower = 0.80,
design = design,
prior.parm = list(
m = m, design = design),
prior.type = "both",
details = FALSE)
#
# ++++++++++++ Equivalence test - TOST ++++++++++++
# Sample size est. with uncertain CV and theta0
# -------------------------------------------------
# Study design: 2 parallel groups
# log-transformed data (multiplicative model)
#
# alpha = 0.05, target power = 0.8
# BE margins = 0.8 ... 1.25
# Ratio = 0.95 with 62 df
# CV = 0.4 with 62 df
#
# Sample size (ntotal)
# n exp. power
# 332 0.800366
I don’t suggest that you should use it in practice. AFAIK, not even the main author of this function (Benjamin Lang) does. However, it serves educational purposes to show that it is not that easy and why even properly planned studies might fail.
An alternative to assuming a T/R-ratio is based on statistical
assurance.^{18} This concept uses the distribution of
T/R-ratios and assumes an uncertainty parameter \(\small{\sigma_\text{u}}\). A natural
assumption is \(\small{\sigma_\text{u}=1-\theta_0}\),
i.e., for the commonly applied T/R-ratio of 0.95 one can use
the argument sem = 0.05
in the function
expsampleN.TOST()
, where the argument theta0
has to be fixed at 1.
<- 0.40
CV <- 0.95
theta0 <- 0.80
target <- "parallel"
design <- 1 - theta0
sigma.u <- data.frame(theta0 = theta0,
comp n.1 = NA, power = NA,
sigma.u = sigma.u,
n.2 = NA, assurance = NA)
2:3] <- sampleN.TOST(CV = CV,
comp[targetpower = target,
theta0 = theta0,
design = design,
details = FALSE,
print = FALSE)[7:8]
5:6] <- expsampleN.TOST(CV = CV,
comp[theta0 = 1, # fixed!
targetpower = target,
design = design,
prior.type = "theta0",
prior.parm =
list(sem = sigma.u),
details = FALSE,
print = FALSE)[9:10]
names(comp)[c(2, 5)] <- rep("n", 2)
print(signif(comp, 6), row.names = FALSE)
# theta0 n power sigma.u n assurance
# 0.95 130 0.803512 0.05 126 0.80161
Out of the box sampleN.TOST()
gives equally sized groups
(treatment arms), i.e., based on an allocation ratio of 1. If
we directly use this sample size for another allocation ratio, power
will be compromised. However, we can correct for that.
An example for a desired T:R allocation of 2:1.
<- function(n, x) {
round.up return(as.integer(x * (n %/% x + as.logical(n %% x))))
}<- 0.40 # total (pooled) CV
CV <- 0.95 # assumed T/R-ratio
theta0 <- 0.80 # target (desired) power
target <- 2 # allocation of T-arm
a.T <- 1 # allocation of R-arm
a.R <- "parallel"
design # 1:1
<- sampleN.TOST(CV = CV, theta0 = theta0, design = design,
tmp targetpower = target, print = FALSE)
.0 <- as.integer(tmp[["Sample size"]])
n.0 <- tmp[["Achieved power"]]
pwr# 2:1 (naïve)
.1 <- setNames(c(round.up(n.0/(a.T+a.R)*a.T, a.T),
nround.up(n.0/(a.T+a.R)*a.R, a.R)),
c("Test", "Reference"))
.1 <- power.TOST(CV = CV, theta0 = theta0, n = n.1, design = design)
pwr# 2:1 (preserving power)
.2 <- n.1
nrepeat {# increase the sample size if necessary
.2 <- power.TOST(CV = CV, theta0 = theta0, n = n.2, design = design)
pwrif (pwr.2 >= target) break
.2[["Test"]] <- as.integer(n.2[["Test"]] + a.T)
n.2[["Reference"]] <- as.integer(n.2[["Reference"]] + a.R)
n
}cat("n =", n.0, "(1:1 allocation)",
"\n nT = nR =", n.0/2,
"\n power =", sprintf("%.4f", pwr.0),
"\nn =", sum(n.1), "(naïve", paste0(a.T, ":", a.R, " allocation)"),
paste0("\n nT = ", n.1[["Test"]], ", nR = ", n.1[["Reference"]]),
"\n power =",sprintf("%.4f", pwr.1),
"\nn =", sum(n.2), paste0("(", a.T, ":", a.R, " allocation)"),
paste0("\n nT = ", n.2[["Test"]], ", nR = ", n.2[["Reference"]]),
"\n power =", sprintf("%.4f", pwr.2), "\n")
# n = 130 (1:1 allocation)
# nT = nR = 65
# power = 0.8035
# n = 132 (naïve 2:1 allocation)
# nT = 88, nR = 44
# power = 0.7618
# n = 147 (2:1 allocation)
# nT = 98, nR = 49
# power = 0.8060
In order to preserve power, we need 13% more subjects than for the 1:1 allocation.
It is not unusual that equivalence of more than one endpoint has to be demonstrated. In bioequivalence the pharmacokinetic metrics C_{max} and AUC_{0–t} are mandatory (in some jurisdictions like the FDA additionally AUC_{0–∞}).
We don’t have to worry about multiplicity issues (inflated Type I Error) since if all tests must pass at level \(\small{\alpha}\), we are protected by the intersection-union principle.^{19} ^{20} We design the study always for the worst case combination, i.e., based on the PK metric requiring the largest sample size. In some jurisdictions wider BE limits for C_{max} are acceptable. Let’s explore that with different CVs and T/R-ratios.
<- c("Cmax", "AUCt", "AUCinf")
metrics <- c(0.40, 0.30, 0.33)
CV <- c(0.95, 1.07, 1.06)
theta0 <- c(0.75, 0.80, 0.80)
theta1 <- 1 / theta1
theta2 <- "parallel"
design <- 0.80
target <- data.frame(metric = metrics,
x theta1 = theta1, theta2 = theta2,
CV = CV, theta0 = theta0,
n = NA_integer_)
for (i in 1:nrow(x)) {
$n[i] <- sampleN.TOST(CV = CV[i],
xtheta0 = theta0[i],
theta1 = theta1[i],
theta2 = theta2[i],
targetpower = target,
design = design,
print = FALSE)[["Sample size"]]
}$theta1 <- sprintf("%.4f", x$theta1)
x$theta2 <- sprintf("%.4f", x$theta2)
x<- paste0("Sample size based on ",
txt $metric[x$n == max(x$n)], ".\n")
xprint(x, row.names = FALSE)
cat(txt)
# metric theta1 theta2 CV theta0 n
# Cmax 0.7500 1.3333 0.40 0.95 72
# AUCt 0.8000 1.2500 0.30 1.07 90
# AUCinf 0.8000 1.2500 0.33 1.06 98
# Sample size based on AUCinf.
Even if we assume the same T/R-ratio for two PK metrics, we will get a wider margin for the one with lower variability.
Let’s continue with the conditions of our previous examples, this time assuming that the CV and T/R-ratio were applicable for C_{max}. As common in PK, the CV of AUC is lower, say only 0.30. That means, the study is ‘overpowered’ for the assumed T/R-ratio of AUC.
Which are the extreme T/R-ratios (largest deviations of T from R) giving still the target power?
<- function(y) {
opt power.TOST(theta0 = y, CV = x$CV[2],
theta1 = theta1,
theta2 = theta2,
n = x$n[1],
design = design) - target
}<- c("Cmax", "AUC")
metrics <- c(0.40, 0.30) # Cmax, AUC
CV <- 0.95 # both metrics
theta0 <- 0.80
theta1 <- 1.25
theta2 <- "parallel"
design <- 0.80
target <- data.frame(metric = metrics, theta0 = theta0,
x CV = CV, n = NA_integer_,
power = NA_real_)
for (i in 1:nrow(x)) {
4:5] <- sampleN.TOST(CV = CV[i],
x[i, theta0 = theta0,
theta1 = theta1,
theta2 = theta2,
targetpower = target,
design = design,
print = FALSE)[7:8]
}$power <- signif(x$power, 6)
xif (theta0 < 1) {
<- uniroot(opt, tol = 1e-8,
res interval = c(theta1 + 1e-4, theta0))
else {
} <- uniroot(opt, tol = 1e-8,
res interval = c(theta0, theta2 - 1e-4))
}<- unlist(res)
res <- c(res[["root"]], 1/res[["root"]])
theta0s <- paste0("Target power for ", metrics[2],
txt " and sample size ",
$n[1], "\nachieved for theta0 ",
xsprintf("%.4f", theta0s[1]), " or ",
sprintf("%.4f", theta0s[2]), ".\n")
print(x, row.names = FALSE)
cat(txt)
# metric theta0 CV n power
# Cmax 0.95 0.4 130 0.803512
# AUC 0.95 0.3 76 0.803123
# Target power for AUC and sample size 130
# achieved for theta0 0.9099 or 1.0990.
That means, although we assumed for AUC the same T/R-ratio as for C_{max} – with the sample size of 130 required for C_{max} – for AUC it can be as low as ~0.91 or as high as ~1.10, which is a soothing side-effect.
Furthermore, sometimes we have less data of AUC than of C_{max} (samples at the end of the profile missing or unreliable \(\small{\widehat{\lambda}_\text{z}}\)$ in some subjects and therefore, less data of AUC_{0–∞} than of AUC_{0–t}). Again, it will not hurt because for the originally assumed T/R-ratio we need only 76 subjects.
Since – as a one-point metric – C_{max} is
inherently more variable than AUC, Health Canada does not
require assessment of its Confidence Interval. Only the point estimate
has to lie within 80.0 – 125.0%. We can explore that by setting
alpha = 0.5
for it.^{21}
<- c("Cmax", "AUC")
metrics <- c(1.00, 0.40)
CV <- 0.95
theta0 <- "parallel"
design <- 0.80
target <- c(0.50, 0.05)
alpha <- data.frame(metric = metrics, CV = CV, theta0 = theta0,
x alpha = alpha, n = NA_integer_)
for (i in 1:nrow(x)) {
$n[i] <- sampleN.TOST(alpha = alpha[i],
xCV = CV[i],
theta0 = theta0,
targetpower = target,
design = design,
print = FALSE)[["Sample size"]]
}<- paste0("Sample size based on ",
txt $metric[x$n == max(x$n)], ".\n")
xprint(x, row.names = FALSE)
cat(txt)
# metric CV theta0 alpha n
# Cmax 1.0 0.95 0.50 102
# AUC 0.4 0.95 0.05 130
# Sample size based on AUC.
Note the extremely high CV. You will hardly ever face a situation where you have to base the sample size on C_{max}.
Say, the number of test treatments in a pivotal study is \(\small{k}\). Then^{22}
Say, we have two test treatments and aim to show bioequivalence for
both of them to the reference treatment (case a). Note the argument alpha
in the
function sampleN.TOST()
:
<- 0.05 # nominal (overall) level
alpha0 <- 0.40 # assumed total CV
CV <- 0.95 # assumed T/R-ratio(s)
theta0 <- 0.80 # target (desired) power
target <- 2 # number of tests
k # Note that a formal sample size estimation is not required in a pilot study;
# only in a pivotal study!
<- FALSE # all tests should pass
adj if (!adj) { # unadjusted
<- alpha0
alpha else { # Bonferroni
} <- alpha0 / k
alpha
}# Total sample size and power for two groups
<- sampleN.TOST(alpha = alpha,
tmp CV = CV,
theta0 = theta0,
targetpower = target,
design = "parallel",
print = FALSE)
<- tmp[["Sample size"]]
N <- tmp[["Achieved power"]]
pwr # Total sample size for k + 1 groups
<- N * (k + 1) / 2
n <- paste0("Tests : ", k,
txt "\nGroups : ", k + 1)
if (adj) {
<- paste0(txt, "\nalpha : ", alpha0,
txt "\nMultiplicity adjustment : Bonferroni",
"\nAdjusted alpha : ", signif(alpha, 6))
else {
} <- paste0(txt, "\nalpha : ", alpha0)
txt
}<- paste0(txt, "\nSample size per group : ", N / 2,
txt "\nTotal sample size : ", n,
"\nAchieved power : ", signif(pwr, 5),
sprintf("\nAll comparisons with %.2f%% CI\n",
100 * (1 - 2 * alpha)))
cat(txt)
# Tests : 2
# Groups : 3
# alpha : 0.05
# Sample size per group : 65
# Total sample size : 195
# Achieved power : 0.80351
# All comparisons with 90.00% CI
Compare that to case b, where we want to show bioequivalence for any of the test treatments to the reference treatment:
<- 0.05 # nominal (overall) level
alpha0 <- 0.40 # assumed total CV
CV <- 0.95 # assumed T/R-ratio(s)
theta0 <- 0.80 # target (desired) power
target <- 2 # number of tests
k # Note that a formal sample size estimation is not required in a pilot study;
# only in a pivotal study!
<- TRUE # any of the tests should pass
adj if (!adj) { # unadjusted
<- alpha0
alpha else { # Bonferroni
} <- alpha0 / k
alpha
}# Total sample size and power for two groups
<- sampleN.TOST(alpha = alpha,
tmp CV = CV,
theta0 = theta0,
targetpower = target,
design = "parallel",
print = FALSE)
<- tmp[["Sample size"]]
N <- tmp[["Achieved power"]]
pwr # Total sample size for k + 1 groups
<- N * (k + 1) / 2
n <- paste0("Tests : ", k,
txt "\nGroups : ", k + 1)
if (adj) {
<- paste0(txt, "\nalpha : ", alpha0,
txt "\nMultiplicity adjustment : Bonferroni",
"\nAdjusted alpha : ", signif(alpha, 6))
else {
} <- paste0(txt, "\nalpha : ", alpha0)
txt
}<- paste0(txt, "\nSample size per group : ", N / 2,
txt "\nTotal sample size : ", n,
"\nAchieved power : ", signif(pwr, 5),
sprintf("\nAll comparisons with %.2f%% CI\n",
100 * (1 - 2 * alpha)))
cat(txt)
# Tests : 2
# Groups : 3
# alpha : 0.05
# Multiplicity adjustment : Bonferroni
# Adjusted alpha : 0.025
# Sample size per group : 81
# Total sample size : 243
# Achieved power : 0.80014
# All comparisons with 95.00% CI
The sample size increases by ~25% from 195 (65 subjects per group)
required in the unadjusted case, where both tests would have to pass
with \(\small{\alpha\;0.05}\).
IMHO, due to a large sample
size penalty including more than two tests is questionable. With three
test (\(\small{\alpha\;0.0167}\)) we
would require 364 subjects and with four (\(\small{\alpha\;0.0125}\)) 490. Instead, I
suggest to perform a – reasonably large – pilot study in order to select
between ‘candidates’ and perform the pivotal study with only this
treatment.
© CC-BY-SA 2.0 ICMA Photos
<- 1 / 2 # Probability to win a single toss
prob <- data.frame(tosses = 1L:4L, success = NA_real_)
res for (i in 1:nrow(res)) {
$success[i] <- pbinom(i - 1, i, prob)
res
}names(res) <- c("toss(es)", "p (win)")
print(res, row.names = FALSE, right = FALSE)
# toss(es) p (win)
# 1 0.5000
# 2 0.7500
# 3 0.8750
# 4 0.9375
That’s trivial: The more often you try, the higher the chance to win at least one of the bets.
As the number of comparisons increases, the Bonferroni-adjustment becomes more conservative. Another issue is that the tests are not strictly independent (at least when we compare different treatments to the same reference treatment some – unknown – correlation exists).
The ICH’s draft guideline^{22} states:
“
- If the objective is to achieve BE for all test formulations versus the comparator product, then no alpha adjustment is needed.
- If the objective is to show BE for any of the test formulations, then multiplicity adjustment may be needed.
Comments:
So far we employed the common (and hence, default)
BE-limits theta1 = 0.80
and theta2 = 1.25
. In some jurisdictions tighter limits of
90.00 – 111.11% have to be used.^{28} What if we have to deal with a
NTID in oncology (a
study in patients)?
You have to provide only the lower
BE-limit theta1
(the
upper one will be automatically calculated).
sampleN.TOST(CV = 0.40, design = "parallel",
theta1 = 0.90)
#
# +++++++++++ Equivalence test - TOST +++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups
# log-transformed data (multiplicative model)
#
# alpha = 0.05, target power = 0.8
# BE margins = 0.9 ... 1.111111
# True ratio = 0.95, CV = 0.4
#
# Sample size (total)
# n power
# 1258 0.800289
Terrible.
Interlude: Health Canada
Health Canada requires for NTIDs (termed by HC ‘critical dose drugs’) that the confidence interval of C_{max} lies within 80.0 – 125.0% and the one of AUC within 90.0 – 112.0%.
<nitpick></nitpick>
Hence, for Health Canada you have to set both limits.
sampleN.TOST(CV = 0.40, design = "parallel",
theta1 = 0.90, theta2 = 1.12)
#
# +++++++++++ Equivalence test - TOST +++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups
# log-transformed data (multiplicative model)
#
# alpha = 0.05, target power = 0.8
# BE margins = 0.9 ... 1.12
# True ratio = 0.95, CV = 0.4
#
# Sample size (total)
# n power
# 1258 0.800289
Here we get the same sample size than with the common limits for NTIDs.
It’s extremely unlikely to require fewer subjects than with exact limits:
<- 0.80
target <- 0.95
theta0 <- 0.20
CV.min <- 0.80
CV.max <- 2500L
CV.step <- seq(CV.min, CV.max, length.out = CV.step)
CV <- data.frame(CV = CV, n.EMA = NA_integer_,
res n.HC = NA_integer_, less = FALSE)
for (i in 1:nrow(res)) {
$n.EMA[i] <- sampleN.TOST(CV = CV[i], theta0 = theta0,
restargetpower = target, design = "parallel",
theta1 = 0.90, theta2 = 1/0.90,
print = FALSE)[["Sample size"]]
$n.HC[i] <- sampleN.TOST(CV = CV[i], theta0 = theta0,
restargetpower = target, design = "parallel",
theta1 = 0.90, theta2 = 1.12,
print = FALSE)[["Sample size"]]
if (res$n.HC[i] < res$n.EMA[i]) res$less[i] = TRUE
}cat("target power :", target,
"\ntheta0 :", theta0,
"\nCV :", CV.min, "\u2013", CV.max,
"\nLower sample size for HC :",
sprintf("%.4f%% of %i cases.\n", 100 * sum(res$less) / CV.step, CV.step))
# target power : 0.8
# theta0 : 0.95
# CV : 0.2 – 0.8
# Lower sample size for HC : 0.0000% of 2500 cases.
The FDA requires for NTIDs a tighter batch-release specification (±5% instead of ±10%). Let’s hope that your product complies and the T/R-ratio will be closer to 1:
sampleN.TOST(CV = 0.40, design = "parallel",
theta0 = 0.975, # ‘better’ T/R-ratio
theta1 = 0.90)
#
# +++++++++++ Equivalence test - TOST +++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups
# log-transformed data (multiplicative model)
#
# alpha = 0.05, target power = 0.8
# BE margins = 0.9 ... 1.111111
# True ratio = 0.975, CV = 0.4
#
# Sample size (total)
# n power
# 588 0.801308
Substantially lower sample size but still not nice.
Luckily most NTIDs do not show a large variability (specifically within subjects). However, this does not help much because in a parallel design the total CV consists of both within and between subject components. Let’s be optimistic and assume a lower CV.
sampleN.TOST(CV = 0.30, design = "parallel",
theta0 = 0.975, theta1 = 0.90)
#
# +++++++++++ Equivalence test - TOST +++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups
# log-transformed data (multiplicative model)
#
# alpha = 0.05, target power = 0.8
# BE margins = 0.9 ... 1.111111
# True ratio = 0.975, CV = 0.3
#
# Sample size (total)
# n power
# 342 0.801324
Well… Without a reasonably large pilot study (where we estimate the CV) such an assumption would be pretty risky – it may fire back. What if we performed the study in 342 patients and our assumption was too optimistic (say, the CV turned out to be 0.35)?
power.TOST(CV = 0.35, design = "parallel",
theta0 = 0.975, theta1 = 0.90,
n = 342)
# [1] 0.6727196
Hardly better than betting ‘two dozen’ or ‘two columns’ in Roulette \(\left(\small{\frac{2\,\times\,12}{1\,+\,36}\sim0.65}\right)\). I hope you don’t want to go there…
Sometimes we are interested in assessing differences of responses and
not their ratios. In such a case we have to set
logscale = FALSE
. The limits theta1
and
theta2
can be expressed in the following ways:
Note that in the former case the units of CV
, and
theta0
need also to be given relative to the reference mean
(specified as ratio).
Let’s estimate the sample size for an equivalence trial of two blood pressure lowering drugs assessing the difference in means of untransformed data (raw, linear scale). In this setup everything has to be given with the same units (i.e., here the assumed difference –5 mm Hg and the lower / upper limits ∓15 mm Hg systolic blood pressure). Furthermore, we assume a CV of 25 mm Hg.
sampleN.TOST(CV = 20, theta0 = -5,
theta1 = -15, theta2 = +15,
logscale = FALSE, design = "parallel")
#
# +++++++++++ Equivalence test - TOST +++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups
# untransformed data (additive model)
#
# alpha = 0.05, target power = 0.8
# BE margins = -15 ... 15
# True diff. = -5, CV = 20
#
# Sample size (total)
# n power
# 102 0.805523
Sometimes in the literature we find not the CV but the standard deviation of the difference SD. Say, it is given with 36 mm Hg. We have to convert it to a CV. In a parallel design CV = SD / 2.
<- 36
SD sampleN.TOST(CV = SD / 2, theta0 = -5,
theta1 = -15, theta2 = +15,
logscale = FALSE, design = "parallel")
#
# +++++++++++ Equivalence test - TOST +++++++++++
# Sample size estimation
# -----------------------------------------------
# Study design: 2 parallel groups
# untransformed data (additive model)
#
# alpha = 0.05, target power = 0.8
# BE margins = -15 ... 15
# True diff. = -5, CV = 18
#
# Sample size (total)
# n power
# 82 0.801691
Note that other software packages (e.g., PASS, nQuery, StudySize, …) require the standard deviation of the difference as input.
It would be a fundamental flaw if responses are assumed to a follow a normal distribution and not – like above – assess their difference and at the end instead giving an estimate of the difference, calculate the ratio of the test- and reference-means.
In such a case Fieller’s (‘fiducial’) confidence interval^{29} has to
be employed.
Note that alpha = 0.025
is the default in the functions
sampleN.RatioF()
and power.RatioF()
, since it
is intended for studies with clinical endpoints, where the 95%
confidence interval is usually used for equivalence testing.^{30}
sampleN.RatioF(CV = 0.40, theta0 = 0.95,
design = "parallel")
#
# +++++++++++ Equivalence test - TOST +++++++++++
# based on Fieller's confidence interval
# Sample size estimation
# -----------------------------------------------
# Study design: 2-group parallel
# Ratio of means with normality on original scale
# alpha = 0.025, target power = 0.8
# BE margins = 0.8 ... 1.25
# True ratio = 0.95, CV = 0.4
#
# Sample size
# n power
# 188 0.801068
“He who seeks for methods without having a definite problem in mind seeks in the most part in vain.
PowerTOST
the default method = "exact"
implements Owen’s Q function^{31} which is also used in SAS’
Proc Power
.
Proc Power; twosamplemeans test = equiv_ratio power = . alpha = 0.05 meanratio = 1.0 npergroup = 450 cv = 1.5 lower = 0.80 upper = 1.25 ; run;
If you run the macro in SAS (at least v9.1 is required), you should get 0.84896.^{32}
Other implemented methods are "mvt"
(based on the
bivariate noncentral t-distribution),
"noncentral"
/ "nct"
(based on the noncentral
t-distribution), and "shifted"
/
"central"
(based on the shifted central
t-distribution). Although "mvt"
is also exact,
it may have a somewhat lower precision compared to Owen’s Q and has a
substantially longer run-time.
Power approximated by the shifted central t-distribution is generally slightly lower compared to the others. Hence, if used in sample size estimations, occasionally two more subjects are ‘required’ – which is not correct as demonstrated by the exact method.
<- c("exact", "mvt", "noncentral", "central")
methods <- data.frame(method = methods)
x for (i in 1:nrow(x)) {
<- sampleN.TOST(CV = 0.40, theta0 = 0.90,
tmp design = "parallel",
method = x$method[i],
targetpower = 0.80, print = FALSE)
$n[i] <- tmp[["Sample size"]]
x$power[i] <- tmp[["Achieved power"]]
x$power.exact[i] <- power.TOST(CV = 0.40,
xtheta0 = 0.90,
design = "parallel",
n = x$n[i],
method = "exact")
}print(x, row.names = FALSE)
# method n power power.exact
# exact 266 0.8000762 0.8000762
# mvt 266 0.8000775 0.8000762
# noncentral 266 0.8000762 0.8000762
# central 268 0.8024811 0.8026911
Particularly for ‘nice’ combinations of the CV and T/R-ratio the relative increase in sample sizes (and hence, study costs) might be relevant.
set.seed(123456)
<- 1e4
nsims <- signif(runif(nsims, min = 0.25, max = 0.60), 2)
CV <- signif(runif(nsims, min = 0.85, max = 0.95), 2)
theta0 <- data.frame(CV = CV, theta0 = theta0,
x exact = NA_real_, shifted = NA_real_,
diff = NA_real_, delta.n = FALSE)
<- unique(x)
x for (i in 1:nrow(x)) {
$exact[i] <- sampleN.TOST(CV = x$CV[i],
xtheta0 = x$theta0[i],
design = "parallel",
method = "exact",
print = FALSE)[["Sample size"]]
$shifted[i] <- sampleN.TOST(CV = x$CV[i],
xtheta0 = x$theta0[i],
design = "parallel",
method = "shifted",
print = FALSE)[["Sample size"]]
if (x$shifted[i] != x$exact[i]) x$delta.n[i] <- TRUE
}$diff <- 100 * (x$shifted - x$exact) / x$exact
x<- x[with(x, order(-diff, theta0, CV)), ]
x $diff <- sprintf("%+.3f%%", x$diff)
xprint(x[which(x$delta.n == TRUE), 1:5], row.names = FALSE)
# CV theta0 exact shifted diff
# 0.27 0.95 62 64 +3.226%
# 0.28 0.94 74 76 +2.703%
# 0.28 0.93 84 86 +2.381%
# 0.32 0.93 108 110 +1.852%
# 0.28 0.91 114 116 +1.754%
# 0.26 0.90 118 120 +1.695%
# 0.28 0.90 136 138 +1.471%
# 0.34 0.92 140 142 +1.429%
# 0.45 0.95 160 162 +1.250%
# 0.37 0.92 164 166 +1.220%
# 0.51 0.95 200 202 +1.000%
# 0.53 0.95 214 216 +0.935%
# 0.48 0.93 228 230 +0.877%
# 0.30 0.88 236 238 +0.847%
# 0.51 0.93 254 256 +0.787%
# 0.59 0.95 258 260 +0.775%
# 0.48 0.92 264 266 +0.758%
# 0.40 0.90 266 268 +0.752%
# 0.53 0.93 272 274 +0.735%
# 0.49 0.92 274 276 +0.730%
# 0.50 0.92 284 286 +0.704%
# 0.35 0.88 316 318 +0.633%
# 0.49 0.91 322 324 +0.621%
# 0.31 0.87 324 326 +0.617%
# 0.50 0.91 334 336 +0.599%
# 0.52 0.91 358 360 +0.559%
# 0.34 0.87 386 388 +0.518%
# 0.46 0.88 524 526 +0.382%
# 0.46 0.87 676 678 +0.296%
# 0.54 0.88 698 700 +0.287%
# 0.56 0.88 744 746 +0.269%
# 0.51 0.87 814 816 +0.246%
# 0.53 0.86 1172 1174 +0.171%
Therefore, I recommend to use "method = shifted"
/
"method = central"
only for comparing with old results
(literature, own studies).
You can compare the results of sampleN.TOST()
to
simulations via the ‘key’ statistics (\(\small{\sigma^2}\) follows a \(\small{\chi^2\text{-}}\)distribution with
\(\small{n-2}\) degrees of freedom and
\(\small{\log_{e}(\theta_0)}\) follows
a normal distribution^{33}), which are provided by the function
power.TOST.sim()
.
In the script 100,000 studies are assessed for BE (where empiric power
is the number of passing studies divided by the number of simulations).
Two sample sizes are tried first: n = 12 and n = 288.
Depending on power, the sample size is increased or decreased until at
least the target power is obtained.
<- function(alpha, CV, theta0, theta1, theta2,
sampleN.key info = FALSE) {
targetpower, # sample size by simulating via the ‘key’ statistics
<- function(x) {
opt suppressWarnings(
suppressMessages(
power.TOST.sim(alpha = alpha, CV = CV, theta0 = theta0,
theta1 = theta1, theta2 = theta2,
design = "parallel", nsims = nsims,
n = x))) - targetpower
}<- 1e5L
nsims # if you get a substantial difference, increase the number of simulations
<- proc.time()[[3]]
t <- uniroot(opt, interval = c(12L, 288L),
x extendInt = "upX", tol = 0.2)
if (info) {
message(signif(proc.time()[[3]] - t, 3), " seconds for ", x$iter,
" iterations of ", prettyNum(nsims, format = "i", big.mark = ","),
" simulated studies.\n")
}<- x$root
n <- as.integer(2 * (n %/% 2 + as.logical(n %% 2)))
n <- data.frame(alpha = alpha, CV = CV, theta0 = theta0,
res theta1 = theta1, theta2 = theta2,
n = n)
return(res)
}<- function(alpha, CV, theta0, theta1,
internal.validation
theta2, targetpower,info = FALSE) {
if (missing(alpha)) alpha <- 0.05
if (missing(CV)) stop("CV must be given!")
if (missing(theta0)) theta0 <- 0.95
if (missing(theta1) & missing(theta2)) theta1 <- 0.80
if (missing(theta1)) theta1 = 1/theta2
if (missing(theta2)) theta2 = 1/theta1
if (missing(targetpower)) targetpower <- 0.80
<- data.frame(Approach = "key",
simulated sampleN.key(alpha = alpha, CV = CV, theta0 = theta0,
theta1 = theta1, theta2 = theta2,
targetpower = targetpower,
info = info)[c(2:6)])
<- data.frame(Approach = "exact",
exact sampleN.TOST(alpha = alpha, CV = CV, theta0 = theta0,
theta1 = theta1, theta2 = theta2,
targetpower = targetpower,
design = "parallel", print = FALSE,
details = FALSE)[c(3:7)])
names(exact)[6] <- "n"
<- rbind(simulated, exact)
comp <- cbind(comp, equal = "")
comp $Approach[1] <- "1: \u2018Key\u2019 statistics"
comp$Approach[2] <- "2: Exact (Owen\u2019s Q)"
compnames(comp)[1] <- "Sample size estimation"
if (comp$n[2] == comp$n[1]) {
$equal[2] <- " yes"
compnames(comp)[7] <- "equal?"
else {
}$equal[2] <- sprintf("%+3.0f", comp$n[2] - comp$n[1])
compnames(comp)[7] <- "diff."
}3] <- sprintf("%.4f", comp[, 3])
comp[, 4] <- sprintf("%.4f", comp[, 4])
comp[, 5] <- sprintf("%.4f", comp[, 5])
comp[, print(comp, row.names = FALSE, right = FALSE)
}
Example for CV 0.40 (using all defaults).
internal.validation(CV = 0.40) # info = TRUE for more information
# Sample size estimation CV theta0 theta1 theta2 n equal?
# 1: ‘Key’ statistics 0.4 0.9500 0.8000 1.2500 130
# 2: Exact (Owen’s Q) 0.4 0.9500 0.8000 1.2500 130 yes
The package contains sample data frames from the literature^{34} (show
them by ctSJ.VIII.10
and ctSJ.VIII.20
).
Execute the script test_parallel.R
which can be found in
the /tests
sub-directory of the package to reproduce them.
If results don’t agree, something is wrong with your installation. Do
you use a version <0.9.3 of 2012-02-13?
If yes, update to the current one.
You can also reproduce Table 7.6.^{10} Note that it gives sample sizes for one
arm and therefore, we have to divide the total sample
sizes obtained by sampleN.TOST()
by two. Julious did not
report extreme sample sizes (>3,623) and therefore, I dropped them
from the output as well.
<- paste("True Mean Ratios 0.80\u20131.20",
txt "\n90% power by the noncentral t-distribution",
"\n Levels of Bioequivalence",
"\n 10% -> 90.00\u2013111.11%",
"\n 15% -> 85.00\u2013117.65%",
"\n 20% -> 80.00\u2013125.00%",
"\n 25% -> 75.00\u2013133.33%",
"\n 30% -> 70.00\u2013142.86%",
"\nSample sizes (per arm)\n")
<- function(expr) {
tryCatch.W.E ##' Catch *and* save both errors and warnings,
##' and in the case of a warning, also keep the
##' computed result.
##'
##' @title tryCatch both warnings (with value) and
##' errors
##' @param expr an \R expression to evaluate
##' @return a list with 'value' and 'warning', where
##' 'value' may be an error caught.
##' @author Martin Maechler;
##' Copyright (C) 2010-2012 The R Core Team
<- NULL
W <- function(w) {# warning handler
w.handler <<- w
W invokeRestart("muffleWarning")
}list(value = withCallingHandlers(
tryCatch(expr, error = function(e) e),
warning = w.handler),
warning = W)
}<- function(CV, theta0, theta1) {
fun return(
sampleN.TOST(CV = CV, theta0 = theta0,
theta1 = theta1,
targetpower = 0.90,
design = "parallel",
method = "noncentral",
print = FALSE)[["Sample size"]])
}<- seq(0.30, 0.65, 0.05)
CV <- seq(0.80, 1.20, 0.05)
ratio <- seq(0.90, 0.70, -0.05)
theta1 <- data.frame(CV = rep(CV, each = length(ratio)),
x Ratio = ratio, n10 = NA, n15 = NA,
n20 = NA, n25 = NA, n30 = NA)
for (i in 1:nrow(x)) {
for (j in 3:ncol(x)) {
# needed to continue if ratio at or outside limits
# (otherwise, stops with error)
<- tryCatch.W.E(fun(CV = x$CV[i],
y theta0 = x$Ratio[i],
theta1 = theta1[j-2]))
if (is.numeric(y$value)) x[i, j] <- y$value
}
}3:7] <- x[, 3:7] / 2
x[, > 3623] <- NA_integer_
x[x for (j in 3:ncol(x)) {
<- formatC(x[, j], format = "f",
x[, j] digits = 0, big.mark=",")
}== "NA")] <- ""
x[(x $CV <- x$CV*100
xnames(x)[c(1, 3:7)] <- c("CV (%)",
paste0(seq(10, 30, 5), "%"))
cat(txt); print(x, row.names = FALSE)
# True Mean Ratios 0.80–1.20
# 90% power by the noncentral t-distribution
# Levels of Bioequivalence
# 10% -> 90.00–111.11%
# 15% -> 85.00–117.65%
# 20% -> 80.00–125.00%
# 25% -> 75.00–133.33%
# 30% -> 70.00–142.86%
# Sample sizes (per arm)
# CV (%) Ratio 10% 15% 20% 25% 30%
# 30 0.80 356 84
# 30 0.85 403 95 40
# 30 0.90 453 108 46 25
# 30 0.95 506 121 51 28 18
# 30 1.00 169 72 39 24 16
# 30 1.05 462 115 50 28 17
# 30 1.10 328 92 41 23
# 30 1.15 2,851 213 69 33
# 30 1.20 887 134 50
# 35 0.80 476 112
# 35 0.85 540 128 54
# 35 0.90 607 144 61 33
# 35 0.95 678 161 69 37 23
# 35 1.00 226 96 51 31 21
# 35 1.05 620 154 67 37 23
# 35 1.10 439 122 55 30
# 35 1.15 286 92 43
# 35 1.20 1,189 179 66
# 40 0.80 611 144
# 40 0.85 693 163 69
# 40 0.90 779 184 78 41
# 40 0.95 871 207 88 48 30
# 40 1.00 291 123 66 40 26
# 40 1.05 796 198 85 47 29
# 40 1.10 564 157 70 38
# 40 1.15 367 117 55
# 40 1.20 1,527 230 85
# 45 0.80 759 178
# 45 0.85 861 203 85
# 45 0.90 968 229 96 51
# 45 0.95 1,082 257 109 59 36
# 45 1.00 361 152 81 49 33
# 45 1.05 988 245 106 58 36
# 45 1.10 700 194 87 47
# 45 1.15 455 146 68
# 45 1.20 1,896 286 105
# 50 0.80 919 216
# 50 0.85 1,041 245 103
# 50 0.90 1,171 277 116 62
# 50 0.95 1,309 310 131 71 44
# 50 1.00 436 184 98 60 39
# 50 1.05 1,195 297 128 70 43
# 50 1.10 847 235 104 57
# 50 1.15 551 176 82
# 50 1.20 2,295 345 127
# 55 0.80 1,088 255
# 55 0.85 1,233 290 121
# 55 0.90 1,387 327 137 73
# 55 0.95 1,550 367 155 84 52
# 55 1.00 516 218 116 70 46
# 55 1.05 1,416 351 151 82 51
# 55 1.10 1,003 278 124 68
# 55 1.15 652 208 97
# 55 1.20 2,718 409 150
# 60 0.80 1,266 297
# 60 0.85 1,434 337 141
# 60 0.90 1,613 381 160 85
# 60 0.95 1,803 427 180 97 60
# 60 1.00 601 253 135 82 54
# 60 1.05 1,647 408 176 96 59
# 60 1.10 1,167 323 144 78
# 60 1.15 759 242 113
# 60 1.20 3,162 476 174
# 65 0.80 1,450 340
# 65 0.85 1,643 386 161
# 65 0.90 1,849 436 183 97
# 65 0.95 2,066 489 207 111 68
# 65 1.00 688 290 154 93 61
# 65 1.05 1,887 468 201 109 68
# 65 1.10 1,337 371 164 90
# 65 1.15 869 277 129
# 65 1.20 3,623 545 200
A comparison of methods implemented in power.TOST()
with
SAS’ Proc Power
for various equally sized groups,
CV 1.5, and T/R-ratio 1.
<- n2 <- c(50, 150, 250, 350, 450, 550)
n1 <- c("Proc Power",
methods " exact", " mvt", " noncentral", " central")
<- data.frame(n1 = rep(n1, each = length(methods)),
x n2 = rep(n2, each = length(methods)),
method = rep(methods, length(n)),
power = c(c(0, rep(NA_real_, 4)),
c(0.10488, rep(NA_real_, 4)),
c(0.48431, rep(NA_real_, 4)),
c(0.71606, rep(NA_real_, 4)),
c(0.84896, rep(NA_real_, 4)),
c(0.92185, rep(NA_real_, 4))),
match = "")
for (i in 1:nrow(x)) {
if (!x$method[i] == "Proc Power") {
<- trimws(x$method[i], which = "left")
method $power[i] <- round(
xpower.TOST(CV = 1.5,
theta0 = 1.0,
n = c(x$n1[i], x$n2[i]),
design = "parallel",
method = method), 5)
}if (method == "exact") {
if (all.equal(x$power[i], x$power[i-1])) x$match[i] <- TRUE
}
}print(x, row.names = FALSE, right = FALSE)
# n1 n2 method power match
# 50 50 Proc Power 0.00000
# 50 50 exact 0.00000 TRUE
# 50 50 mvt 0.00000
# 50 50 noncentral 0.00000
# 50 50 central 0.00000
# 150 150 Proc Power 0.10488
# 150 150 exact 0.10488 TRUE
# 150 150 mvt 0.10488
# 150 150 noncentral 0.10431
# 150 150 central 0.10336
# 250 250 Proc Power 0.48431
# 250 250 exact 0.48431 TRUE
# 250 250 mvt 0.48431
# 250 250 noncentral 0.48431
# 250 250 central 0.48405
# 350 350 Proc Power 0.71606
# 350 350 exact 0.71606 TRUE
# 350 350 mvt 0.71606
# 350 350 noncentral 0.71606
# 350 350 central 0.71589
# 450 450 Proc Power 0.84896
# 450 450 exact 0.84896 TRUE
# 450 450 mvt 0.84897
# 450 450 noncentral 0.84896
# 450 450 central 0.84879
# 550 550 Proc Power 0.92185
# 550 550 exact 0.92185 TRUE
# 550 550 mvt 0.92185
# 550 550 noncentral 0.92185
# 550 550 central 0.92169
Q: Can we use R in a
regulated environment and is PowerTOST
validated?
A: See this document^{35} about the
acceptability of Base R
and its
SDLC.^{36}
R
is updated every couple of
months with documented changes^{37} and maintaining a bug-tracking system.^{38} I
recommend to use always the latest release.
The authors of PowerTOST
have tried their best to provide
reliable and valid results (see also the section Validation above). The package’s NEWS
documents the development of the package, bug-fixes, and introduction of
new methods. Issues are tracked at GitHub (as of
today none is still open). So far the package had >102,000 downloads.
Therefore, it is extremely unlikely that bugs were not detected given
its large user base.
The ultimate responsibility of validating any software (yes, of
SAS as well…) lies in the hands of the user.^{39} ^{40} ^{41} ^{42} As an aside,
statisticians of the
FDA regularly use
R
themselves…
Q: Shall we throw away our sample size
tables?
A: Not at all. File them in your archives to collect
dust. Maybe in the future you will be asked by an agency how you arrived
at a sample size. But: Don’t use them any more. What you should not do
(and hopefully haven’t done before): Interpolate. Power and therefore,
the sample size depends in a highly nonlinear fashion on the five
conditions listed above, which makes
interpolation of values given in tables a nontrivial job.
Q: Which of the methods should we use in our
daily practice?
A: method = exact
. Full stop. Why rely on
approximations? Since it is the default in sampleN.TOST()
and power.TOST()
, you don’t have to give this argument
(saves keystrokes).
Q: I fail to understand your example about
dropouts. We finish the study with 130 eligible subjects as desired. Why
is the dropout-rate ~11% and not the anticipated 10%?
A: That’s due to rounding up the calculated adjusted
sample size (144.44…) to the next even number (146).
If you manage it to dose a fractional subject (I can’t), your dropout
rate would indeed be the anticipated one:
100 × (1 – 130/144.44…) = 10%. ⬜
Q: Do we have to worry about unequal group
sizes?
A: sampleN.TOST()
will always give the
total number of subjects for equally sized groups. If
you want to design a study with unequal group sizes, see above.
If you are interested in post hoc power, give the sample size
as a vector, i.e.,
power.TOST(..., n = c(foo, bar))
, where foo
and bar
are the sizes of groups.
For the evaluation of a study with unequal group sizes see the
next Q&A (though that’s not related to
PowerTOST
).
Q: Do we have to worry about unequal variances
(heteroscedasticty)?
A: Not in planning the study. Esp. in parallel designs
you will always try to keep the characteristics of groups
(in- / exclusion criteria, demographics, metabolic
status, …) as similar as possible. It would be a bad idea to have
ballerinas in one group and Sumo-wrestlers in the other. However, at the
end of the study characteristics might differ indeed. The conventional
t-test is sensitive to both unequal group sizes and – to a
lesser extent – unequal variances.
What you should not do in the evaluation: Run any pre-test (F-test,
Levene’s
test, Bartlett’s
test, Brown–Forsythe
test) because it will inflate the Type I Error.^{43} Use the Welch-Satterthwaite
test^{44} instead, which – not by any chance – is
the default in R, SAS, and other software
packages.
Since you read that far, I’m confident that you don’t use Excel. Even in
its current version T.TEST(,,,3)
(where 3
specifies heteroscedasticity) rounds the degrees of freedom
down to the next integer – which is wrong – they are
real numbers! Hence, the confidence interval will be wider than
necessary.
Q: Is it possible to simulate power of
studies?
A: That’s not necessary, since the methods in
power.TOST()
provide analytical (exact) solutions.
However, if you don’t trust them, simulations are possible with the
function power.TOST.sim()
, which employs the distributional
properties.
Convergence takes a while. Empiric power, its standard error, its
relative error compared to exact power, and execution times on a a Xeon
E3-1245v3 @ 3.40GHz (1/4 cores) 16GB RAM with R
4.3.1 on Windows 7 build 7601,
Service Pack 1, Universal C Runtime 10.0.10240.16390:
<- 0.40
CV <- "parallel"
design <- sampleN.TOST(CV = CV, design = design,
n print = FALSE)[["Sample size"]]
<- c(1e5L, 5e5L, 1e6L, 5e6L, 1e7L, 5e7L, 1e8L)
nsims <- signif(power.TOST(CV = CV, n = n,
exact design = design), 5)
<- data.frame(simulations = nsims,
x exact = rep(exact, length(nsims)),
simulated = NA_real_, SE = NA_real_,
RE = NA_real_)
for (i in 1:nrow(x)) {
<- proc.time()[[3]]
start.time $simulated[i] <- power.TOST.sim(CV = CV, n = n,
xdesign = design,
nsims = nsims[i])
$secs[i] <- proc.time()[[3]] - start.time
x$SE[i] <- sqrt(0.5 * x$simulated[i] / nsims[i])
x$RE <- 100 * (x$simulated - exact) / exact
x
}$exact <- signif(x$exact, 5)
x$SE <- formatC(x$SE, format = "f", digits = 5)
x$RE <- sprintf("%+.5f%%", x$RE)
x$simulated <- signif(x$simulated, 5)
x$simulations <- prettyNum(x$simulations, format = "i", big.mark = ",")
xnames(x)[c(1, 3)] <- c("sim\u2019s", "sim\u2019d")
print(x, row.names = FALSE)
# sim’s exact sim’d SE RE secs
# 100,000 0.80351 0.80286 0.00200 -0.08090% 0.09
# 500,000 0.80351 0.80263 0.00090 -0.10952% 0.52
# 1,000,000 0.80351 0.80318 0.00063 -0.04057% 1.15
# 5,000,000 0.80351 0.80369 0.00028 +0.02243% 5.59
# 10,000,000 0.80351 0.80353 0.00020 +0.00192% 10.42
# 50,000,000 0.80351 0.80351 0.00009 -0.00050% 50.93
# 100,000,000 0.80351 0.80351 0.00006 +0.00044% 100.25
With an increasing number of simulations values become more precise (the standard error decreases) as well as more accurate (the relative error decreases).
In the Monte Carlo method values approach the exact one asymptotically.
top of section ↩︎ previous section ↩︎
David Dubins (Leslie Dan Faculty of Pharmacy, University of Toronto) for helpful comments and Anže Kastelic (Krka, Novo mesto, Slovenia) for discovering a bug in my scripts calculating \(\small{CV_\text{total}}\).
Helmut Schütz 2023
R
and PowerTOST
GPL 3.0,
klippy
MIT,
pandoc
GPL 2.0.
1^{st} version February 26, 2021. Rendered October 1, 2023 09:00
CEST by rmarkdown via pandoc in 1.08 seconds.
Labes D, Schütz H, Lang B. PowerTOST: Power and Sample Size for (Bio)Equivalence Studies. Package version 1.5.5. 2023-06-27. CRAN.↩︎
Labes D, Schütz H, Lang B. Package ‘PowerTOST’. February 21, 2022. CRAN.↩︎
Senn S. Guernsey McPearson’s Drug Development Dictionary. April 21, 2020. Online.↩︎
Hoenig JM, Heisey DM. The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. Am Stat. 2001; 55(1): 19–24. doi:10.1198/000313001300339897. Open Access.↩︎
Even if we have both variabilities from a crossover study, we have to pool them for the sample size estimation of a parallel design.↩︎
Midha KK, Ormsby ED, Hubbard JW, McKay G, Hawes EM, Gavalas L, McGilveray IJ. Logarithmic Transformation in Bioequivalence: Application with Two Formulations of Perphenazine. J Pharm Sci. 1993; 82(2): 138–44. doi:10.1002/jps.2600820205.↩︎
Hauschke D, Steinijans VW, Diletti E, Schall R, Luus HG, Melze M, Blume H. Presentation of the intrasubject coefficient of variation for sample size planning in bioequivalence studies. Int J Clin Pharmacol Ther. 1994; 32(7): 376-8. PMID:7952801.↩︎
Senn S. Statistical Issues in Drug Development. Chichester: John Wiley; 2^{nd} ed 2007.↩︎
Julious SA. Sample Sizes for Clinical Trials. Boca Raton: CRC Press; 2010. Chapter 7.3.↩︎
Common batch release specifications are ±10% of the declared content. Of course, you will try to find a batch of the reference product which is as close as possible to the test (the EMA recommends ≤5%). However, keep analytical (in)accuracy and (im)precision in mind. Furthermore, the analytical method was only validated for the test. Try to get a CoA from the originator… Good luck. You never can be sure.↩︎
Senn S. Statistical Issues in Drug Development. Chichester: Wiley; 2004.↩︎
Zhang P. A Simple Formula for Sample Size Calculation in Equivalence Studies. J Biopharm Stat. 2003; 13(3): 529–38. doi:10.1081/BIP-120022772.↩︎
Doyle AC. The Adventures of Sherlock Holmes. A Scandal in Bohemia. 1892. p. 3.↩︎
Schütz H. Sample Size Estimation in Bioequivalence. Evaluation. 2020-10-23. BEBA Forum.↩︎
Lenth RV. Two Sample-Size Practices that I Don’t Recommend. October 24, 2000. Online.↩︎
Quoting my late father: »If you believe, go to church.«↩︎
Ring A, Lang B, Kazaroho C, Labes D, Schall R, Schütz H. Sample size determination in bioequivalence studies using statistical assurance. Br J Clin Pharmacol. 2019; 85(10): 2369–77. doi:10.1111/bcp.14055. Open Access.↩︎
Berger RL, Hsu JC. Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets. Stat Sci. 1996; 11(4): 283–302. JSTOR:2246021.↩︎
Zeng A. The TOST confidence intervals and the coverage probabilities with R simulation. March 14, 2014. Online.↩︎
With alpha = 0.5
the test effectively
reduces to a dichotomous (pass|fail) assessment.↩︎
ICH. Bioequivalence for Immediate Release Solid Oral Dosage Forms. M13A – Draft. Section 2.2.5.2 Multiple Test Products. 20 December 2022. Online.↩︎
Medicines for Europe. 2^{nd} Bioequivalence Workshop. Session 2 – ICH M13 – Bioequivalence for IR solid oral dosage forms. Brussels. 26 April 2023.↩︎
Since different patient populations are concerned, the Type I Error is not affected.↩︎
FDA, CDER. Draft Guidance for Industry. Statistical Approaches to Establishing Bioequivalence. Silver Spring. December 2022. Revision 1. Download.↩︎
FDA (CDER, OGD). Navigating the First ICH Generic Drug Draft Guideline “M13A Bioequivalence for Immediate-Release Solid Oral Dosage Forms”. SBIA Webinar. May 2, 2023. Online.↩︎
EMA, CHMP. Guideline on multiplicity issues in clinical trials – Draft. London. 15 December 2016. Online.↩︎
Only the FDA and China’s CDE recommend Reference-Scaled Average Bioequivalence (RSABE) and a comparison of within-subject standard deviations of products – requiring a fully replicated design. It is covered in another article. It is unclear what one can do if only a parallel design is feasible (drugs with a very long half life, studies in patients).↩︎
Fieller EC. Some Problems In Interval Estimation. J Royal Stat Soc B. 1954; 16(2): 175–85. JSTOR:2984043.↩︎
EMEA, CPMP. Points to Consider on Switching between Superiority and Non-Inferiority. London. 27 July 2000. Online.↩︎
Owen DB. A special case of a bivariate non-central t-distribution. Biometrika. 1965; 52(3/4): 437–46. doi:10.2307/2333696.↩︎
Labes D. Power parallel group: PowerTOST/SAS/PASS. 2015-08-14. BEBA Forum.↩︎
Zheng C, Wang J, Zhao L. Testing bioequivalence for multiple formulations with power and sample size calculations. Pharm Stat. 2012; 11(4): 334–41. doi:10.1002/pst.1522.↩︎
Julious SA. Tutorial in Biostatistics. Sample sizes for clinical trials with Normal data. Stat. Med. 2004; 23(12): 1921–86. doi:10.1002/sim.1783.↩︎
The R Foundation for Statistical Computing. A Guidance Document for the Use of R in Regulated Clinical Trial Environments. Vienna. October 18, 2021. Online.↩︎
The R Foundation for Statistical Computing. R: Software Development Life Cycle. A Description of R’s Development, Testing, Release and Maintenance Processes. Vienna. October 18, 2021. Online.↩︎
ICH. Statistical Principles for Clinical Trials. E9. 5 February 1998. Section 5.8. Online.↩︎
FDA. Statistical Software Clarifying Statement. May 6, 2015. Download.↩︎
WHO. Guidance for organizations performing in vivo bioequivalence studies. Geneva. May 2016. Technical Report Series No. 996, Annex 9. Section 4.4. Online.↩︎
ICH. Good Clinical Practice (GCP). E6(R3) – Draft. 19 May 2023. Section 4.5. Online.↩︎
Zimmerman DW. A note on preliminary tests of equality of variances. Br J Math Stat Psychol. 2004; 57(1): 173–81. doi:10.1348/000711004849222.↩︎
If groups sizes and variances are equal, it will reduce to the common t-test anyhow.↩︎