Consider allowing JavaScript. Otherwise, you have to be proficient in reading since formulas will not be rendered. Furthermore, the table of contents in the left column for navigation will not be available and codefolding not supported. Sorry for the inconvenience.
Examples in this article were generated with 4.1.0 by the package PowerTOST
.^{1}
More examples are given in the respective vignette.^{2} See also the README on GitHub for an overview, the online manual^{3} for details, and a collection of other articles.
Abbreviation  Meaning 

α  Nominal level of the test, probability of Type I Error, patient’ risk 
β  Probability of Type II Error, producer’s risk (1 – π) 
BE  Bioequivalence 
CV  Coefficient of Variation 
CV_{inter}  Betweensubject Coefficient of Variation 
CV_{intra}  Withinsubject Coefficient of Variation 
H_{0}  Null hypothesis 
H_{1}  Alternative hypothesis (also H_{a}) 
NTID  Narrow Therapeutic Index Drug 
π  Power 
TOST  Two OneSided Tests 
An ‘optimal’ study design is one which (taking all assumptions into account) has a reasonably high chance of demonstrating equivalence (power) whilst controlling the consumer risk.
A basic knowledge of R is required. To run the scripts at least version 1.4.3 (20161101) of PowerTOST
is suggested. Any version of R would likely do, though the current release of PowerTOST
was only tested with version 3.6.3 (20200229) and later.
Note that in all functions of PowerTOST
the arguments (say, the assumed T/Rratio theta0
, the BElimits (theta1
, theta2
), the assumed coefficient of variation CV
, etc.) have to be given as ratios and not in percent.
sampleN.TOST()
gives balanced sequences (i.e., the same number of subjects is allocated to each of the sequences). Furthermore, the estimated sample size is the total number of subjects (not subjects per sequence – like in some other software packages).
HigherOrder Designs with three and four treatments are supported.
design  internal name  df  common name 

"3x3"

3x3 crossover

2n4

3×3 Latin Square 
"3x6x3"

3x6x3 crossover

2n4

3×6×3 Williams’ design 
"4x4"

4x4 crossover

3n6

4×4 Latin Square or a Williams’ design 
"3x3"
denotes the Latin Square (ABCBCACAB).
"3x6x3"
denotes the 6sequence Williams’ design (ABCACBBACBCACABCBA).
"4x4"
denotes the Latin Square (ABCDBCDACDABDABC) or
any of the possible Williams’ designs with four periods (ADBCBACDCBDADCAB, ADCBBCDACABDDBAC, ACDBBDCACBADDABC, ACBDBADCCDABDBCA, ABDCBCADCDBADACB, ABCDBDACCADBDCBA).
df
are the degrees of freedom, where n
is the sample size.
“Suppose we have a bioequivalence study with three treatments – A, B, and C – and the objective of the study is to make pairwise comparisons among the treatments. Suppose further that treatment C is different in kind from A and B, so that the assumption of homogeneous variance among the three treatments is questionable. One way to do the analyses, under normality assumptions, is Two at a Time – e.g., to test hypotheses about A and B, use only the data from A and B. Another way is All at Once – include the data from all three treatments in a single analysis, making pairwise comparisons within this analysis. If the assumption of homogeneous variance is correct, the All at Once approach will provide more d.f. for estimating the common variance, resulting in increased power. If the variance of C differs from that of A and B, the All at Once approach may have reduced power or an inflated type I error rate, depending on the direction of the difference in variances.
Which one you should use is discussed further down in the subsections
Most examples deal with studies where the response variables likely follow a lognormal distribution, i.e., we assume a multiplicative model (ratios instead of differences). We work with \(\small{\log_{e}}\)transformed data in order to allow analysis by the ttest (requiring differences). This is the default in most functions of PowerTOST
and hence, the argument logscale = TRUE
does not need to be specified.
It may sound picky but ‘sample size calculation’ (as used in most guidelines and alas, in some publications and textbooks) is sloppy terminology. In order to get prospective power (and hence, a sample size), we need five values:
1 – 2 are fixed by the agency,
3 is set by the sponsor (commonly to 0.80 – 0.90), and
4 – 5 are just (uncertain!) assumptions.
In other words, obtaining a sample size is not an exact calculation like \(\small{2\times2=4}\) but always just an estimation.
“Power Calculation – A guess masquerading as mathematics.
Of note, it is extremely unlikely that all assumptions will be exactly realized in a particular study. Hence, calculating retrospective (a.k.a. post hoc, a posteriori) power is not only futile but plain nonsense.^{8}
Since generally the withinsubject variability CV_{intra} is lower than the betweensubject variability CV_{inter}, crossover studies are so popular. Of note, there is no relationship between CV_{intra} and CV_{inter}. An example are drugs which are subjected to polymorphic metabolism. For these drugs CV_{intra} ≪ CV_{inter}.
However, it is a prerequisite that no carryover from one period to the next exists. Only then the comparison of treatments will be unbiased.^{9}
Crossover studies can be not only performed in healthy volunteers but also in patients with a stable disease (e.g., asthma).
If crossovers are not feasible (e.g., for drugs with a very long halflife or in patients with an instable disease), studies in a parallel design should be performed (covered in another article).
The sample size cannot be directly estimated,
only power calculated for an already given sample size.
The power equations cannot be rearranged to solve for sample size.
“Power. That which statisticians are always calculating but never have.
If you are a nerd out in the wild and have only your scientific calculator with you, use the large sample approximation (i.e., based on the normal distribution of \(\small{\log_{e}}\)transformed data) taking power and \(\small{\alpha}\) into account.^{10}
Convert the coefficient of variation to the variance \[\begin{equation}\tag{1} \sigma^2=\log_{e}(CV^2+1) \end{equation}\] Numbers needed in the following: \(\small{Z_{10.2}=0.8416212}\), \(\small{Z_{10.1}=1.281552}\), \(\small{Z_{10.05}=1.644854}\). \[\begin{equation}\tag{2a} \textrm{if}\;\theta_0<1:n\sim\frac{2\,\sigma^2(Z_{1\beta}Z_{1\alpha})^2}{\left(\log_{e}(\theta_0)\log_{e}(\theta_1)\right)^2} \end{equation}\] \[\begin{equation}\tag{2b} \textrm{if}\;\theta_0>1:n\sim\frac{2\,\sigma^2(Z_{1\beta}Z_{1\alpha})^2}{\left(\log_{e}(\theta_0)\log_{e}(\theta_2)\right)^2} \end{equation}\] \[\begin{equation}\tag{2c} \textrm{if}\;\theta_0=1:n\sim\frac{2\,\sigma^2(Z_{1\beta/2}Z_{1\alpha})^2}{\left(\log_{e}(\theta_2)\right)^2} \end{equation}\] Generally you should use \(\small{(2\textrm{a},2\textrm{b})}\). Only if you are courageous, use \(\small{(2\textrm{c})}\). However, I would not recommend that.^{11}
The sample size by \(\small{(2)}\) can be a real number. We have to round it up to the next even integer to obtain balanced sequences (equal number of subjects in sequences TR
and RT
).
In \(\small{(12)}\) we assume that the population variance \(\small{\sigma^2}\) is known, which is never the case. In practice is is uncertain because only its sample estimate \(\small{s^2}\) is known. Therefore, \(\small{(12)}\) will always give a sample size which is too low:
For \(\small{CV=0.25}\), \(\small{\theta_0=0.95}\), \(\small{\alpha=0.05}\), \(\small{\beta=0.20,}\) and the conventional limits we get 25.383 for the total sample size (rounded up to 27). As we will see later, the correct sample size is also 27.
“Power: That which is wielded by the priesthood of clinical trials, the statisticians, and a stick which they use to beta their colleagues.
Since we evaluate studies based on the ttest, we have to take the degrees of freedom (in the 3treatment designs \(\small{df=2n4}\) and in the 4treatment designs \(\small{df=3n6}\)) into account.
Power depends on the degrees of freedom, which themselves depend on the sample size.
Hence, the power equations are used to estimate the sample size in an iterative way.
Example based on the noncentral tdistribution (warning: 114 lines of code).
get.sequences < function(design) {
known < known.designs()[, c(2, 5)]
ns < as.integer(known[known$design == design,
"steps"])
return(ns)
}
get.df < function(design, n) {
known < known.designs()[, 2:3]
dfe < known[known$design == design, "df"]
df < as.integer(eval(parse(text = dfe)))
return(df)
}
make.equal < function(n, ns) { # equally sized sequences
return(as.integer(ns * (n %/% ns + as.logical(n %% ns))))
}
CV < 0.25 # withinsubject CV
theta0 < 0.95 # T/Rratio
theta1 < 0.80 # lower BElimit
theta2 < 1.25 # upper BElimit
target < 0.80 # desired (target) power
alpha < 0.05 # level of the test
beta < 1  target # producer's risk
design < "3x3"
ns < get.sequences(design)
df < data.frame(method = character(),
iteration = integer(),
n = integer(),
power = numeric())
s2 < log(CV^2 + 1)
s < sqrt(s2)
Z.alpha < qnorm(1  alpha)
Z.beta < qnorm(1  beta)
Z.beta.2 < qnorm(1  beta / 2)
# large sample approximation
if (theta0 == 1) {
num < 2 * s2 * (Z.beta.2 + Z.alpha)^2
n.seq < ceiling(num / log(theta2)^2)
} else {
num < 2 * s2 * (Z.beta + Z.alpha)^2
if (theta0 < 1) {
denom < (log(theta0)  log(theta1))^2
} else {
denom < (log(theta0)  log(theta2))^2
}
n.seq < ceiling(num / denom)
}
n < make.equal(n.seq, ns)
power < pnorm(sqrt((log(theta0)  log(theta2))^2 * n /
(2 * s2))  Z.alpha) +
pnorm(sqrt((log(theta0)  log(theta1))^2 * n /
(2 * s2))  Z.alpha)  1
df[1, 1:4] < c("large sample", "\u2013",
n, sprintf("%.6f", power))
# check by central t approximation
# (since the variance is unknown)
t.alpha < qt(1  alpha, get.df(design, n))
if (theta0 == 1) {
num < 2 * s2 * (Z.beta.2 + t.alpha)^2
n.seq < ceiling(num / log(theta2)^2)
} else {
num < 2 * s2 * (Z.beta + t.alpha)^2
if (theta0 < 1) {
denom < (log(theta0)  log(theta1))^2
} else {
denom < (log(theta0)  log(theta2))^2
}
n.seq < ceiling(num / denom)
}
n < make.equal(n.seq, ns)
t.alpha < qt(1  alpha, get.df(design, n))
power < pnorm(sqrt((log(theta0)  log(theta2))^2 * n /
(2 * s2))  t.alpha) +
pnorm(sqrt((log(theta0)  log(theta1))^2 * n /
(2 * s2))  t.alpha)  1
df[2, 1:4] < c("central t", "\u2013",
n, sprintf("%.6f", power))
i < 1
if (power < target) { # iterate upwards
repeat {
sem < sqrt(2 / n) * s
ncp < c((log(theta0)  log(theta1)) / sem,
(log(theta0)  log(theta2)) / sem)
power < diff(pt(c(+1, 1) * qt(1  alpha,
df = get.df(design, n)),
df = get.df(design, n), ncp = ncp))
df[i + 2, 1:4] < c("noncentral t", i,
n, sprintf("%.6f", power))
if (power >= target) {
break
} else {
i < i + 1
n < n + 2
}
}
} else { # iterate downwards
repeat {
sem < sqrt(2 / n) * s
ncp < c((log(theta0)  log(theta1)) / sem,
(log(theta0)  log(theta2)) / sem)
power < diff(pt(c(+1, 1) * qt(1  alpha,
df = get.df(design, n)),
df = get.df(design, n), ncp = ncp))
df[i + 2, 1:4] < c("noncentral t", i,
n, sprintf("%.6f", power))
if (power < target) {
df < df[nrow(df), ]
break
} else {
i < i + 1
n < n  2
}
}
}
print(df, row.names = FALSE)
R> method iteration n power
R> large sample – 27 0.813971
R> central t – 27 0.805100
R> noncentral t 1 27 0.803494
Now it’s high time to leave Base R behind and continue with PowerTOST
.
Makes our life much easier.
Its sample size functions use a modification of Zhang’s method^{12} for the first guess. You can unveil the course of iterations with details = TRUE
.
R>
R> +++++++++++ Equivalence test  TOST +++++++++++
R> Sample size estimation
R> 
R> Study design: 3x3 crossover
R> Design characteristics:
R> df = 2*n4, design const. = 2, step = 3
R>
R> logtransformed data (multiplicative model)
R>
R> alpha = 0.05, target power = 0.8
R> BE margins = 0.8 ... 1.25
R> True ratio = 0.95, CV = 0.25
R>
R> Sample size search (ntotal)
R> n power
R> 24 0.753428
R> 27 0.803494
R>
R> Exact power calculation with
R> Owen's Q functions.
Two iterations. In many cases it hits the bull’s eye right away, i.e., already with the first guess.
Never (ever!) estimate the sample size based on the large sample approximation \(\small{(2)}\).
We evaluate our studies based on the ttest, right?
The sample size based on the normal distribution may be too small, thus compromising power.
R> method n Delta RE power
R> large sample approx. (2) 27 ±0 ±0.0% 0.803494
R> noncentral tdistribution 27 ±0 ±0.0% 0.803494
R> exact method (Owen’s Q) 27 – – 0.803494
In this example it works but it may fail in others.
Throughout the examples by I’m referring to studies in a single center – not multiple groups within them or multicenter studies. That’s another story.
Most methods of PowerTOST
are based on pairwise comparisons. It is up to you to adjust the level of the test alpha
if you want to compare more (say, two test treatments v.s. a reference or each of them against one of the others) in order to avoid inflation of the familywise error rate due to multiplicity. Details are given further down.
First of all we have have to consider the purpose of the study and how we will evaluate it.
Let’s discuss the cases from above and whether we can evaluate the study by the common \(\small{\alpha}\) 0.05 (i.e., with a 90% confidence interval) or have to adjust for multiplicity.
No multiplicity correction has to be used.
Say, we have two candidate treatments (\(\small{\textrm{C}_1}\), \(\small{\textrm{C}_2}\)) and one reference (\(\small{\textrm{R}}\)).
We would select the candidate with \(\small{\min \left \{\left\log_{e}\theta_{\textrm{C}_1}\right,\left\log_{e}\theta_{\textrm{C}_2}\right\right\}}\) for the pivotal study because it is the most promising one. If \(\small{\min \left \{\left\log_{e}\theta_{\textrm{C}_1}\right\sim\left\log_{e}\theta_{\textrm{C}_2}\right\right\}}\), we would select the candidate with lower withinsubject variance.
In the ‘Two at a Time’ approach we obtain not only two point estimates (like in the ‘All at Once’ approach) but also two withinsubject variances (in the comparisons \(\small{\textrm{T}}\) vs \(\small{\textrm{R}_1}\) and \(\small{\textrm{T}}\) vs \(\small{\textrm{R}_2}\)).^{13} In both approaches no multiplicity correction has to be used.^{14}
Sometimes the pilot was indecisive (very similar T/R ratios and CVs) and companies are wary to select one based on ‘gut feelings’ and include both in the pivotal study.
Submit the ‘better’ one to the authority and stop developing the other.
Opinion split amongst statisticians.
In this approach – which is preferred by some agencies, e.g., the EMA^{16} ^{17} ^{18} – one treatment is excluded and the analysis on the remaining two performed. That means you obtain two separate Incomplete Block Designs (IBD), where the coding (treatments, sequences, periods) is kept.
To plan for this approach specify design = "2x2x2"
.
Apart from the regulatory recommendation I suggest this approach, since the ‘All at Once’ approach may lead to biased estimates and an inflated Type I Error.^{19}
In this approach we assume homogenicity in the ANOVA of pooled data (equal variances in the pairwise comparisons) and get one residual variance.
To plan for this approach specify one of the design arguments given above, i.e., "3x3"
, "3x6x3"
, or "4x4"
.
Whilst popular (and hence, used in most of the examples), I don’t recommend it.
We assume a CV of 0.25, a T/Rratio of 0.95, and target a power of 0.80.
Since theta0 = 0.95
and targetpower = 0.80
are defaults of the function, we don’t have to give them explicitely. As usual in bioequivalence, alpha = 0.05
is employed (we will assess the study by a \(\small{100(12\,\alpha)=90\%}\) confidence interval) and the BElimits are theta1 = 0.80
and theta2 = 1.25
. Since they are also defaults of the function, we don’t have to give them as well. Hence, you need to specify only the CV
and the design
.
R>
R> +++++++++++ Equivalence test  TOST +++++++++++
R> Sample size estimation
R> 
R> Study design: 3x3 crossover
R> logtransformed data (multiplicative model)
R>
R> alpha = 0.05, target power = 0.8
R> BE margins = 0.8 ... 1.25
R> True ratio = 0.95, CV = 0.25
R>
R> Sample size (total)
R> n power
R> 27 0.803494
R>
R> +++++++++++ Equivalence test  TOST +++++++++++
R> Sample size estimation
R> 
R> Study design: 2x2 crossover
R> logtransformed data (multiplicative model)
R>
R> alpha = 0.05, target power = 0.8
R> BE margins = 0.8 ... 1.25
R> True ratio = 0.95, CV = 0.25
R>
R> Sample size (total)
R> n power
R> 28 0.807439
Don’t get confused by Study design: 2x2 crossover
in the output. It only means that we will evaluate the study by the ‘Two at a Time’ approach as an IBD. In the ‘All at Once’ approach we have more degrees of freedom (50) than in the ‘Two at a Time’ approach (26). An example to get the correct sample size is given in the Interlude further down.
Sometimes we are not interested in the entire output and want to use only a part of the results in subsequent calculations. We can suppress the output by the argument print = FALSE
and assign the result to a data.frame (here df
).
Although you could access the elements by the number of the column(s), I don’t recommend that, since in various functions these numbers are different and hence, difficult to remember.
Let’s retrieve the column names of df
:
names(df)
R> [1] "Design" "alpha" "CV"
R> [4] "theta0" "theta1" "theta2"
R> [7] "Sample size" "Achieved power" "Target power"
Now we can access the elements of df
by their names. Note that double square brackets [[…]]
have to be used (single ones are used to access elements by their numbers).
If you insist in accessing elements by columnnumbers, use single square brackets […]
.
With 27 subjects (9 per sequence) we achieve the power we desire.
What happens if we have one dropout?
R> Unbalanced design. n(i)=9/9/8 assumed.
R> [1] 0.7868869
Slightly below the 0.80 we desire.
Interlude 1
As we have seen above, when we plan to evaluate the study by the ‘Two at a Time’ approach, we will get an even number of subjects from sampleN.TOST(design = "2x2")
. However, we want balanced sequences in our design (which will be an odd number in the "3x3"
design, a multiple of six in the "3x6x3"
design, and a multiple of four in the "4x4"
design). Hence, sometimes we have to round the sample size up.
get.sequences < function(design) {
known < known.designs()[, c(2, 5)]
ns < as.integer(known[known$design == design,
"steps"])
return(ns)
}
make.equal < function(n, ns) {
return(as.integer(ns * (n %/% ns + as.logical(n %% ns))))
}
design < "3x3"
CV < 0.25
ns < get.sequences(design)
n.TaT < sampleN.TOST(CV = CV, design = "2x2",
print = FALSE)[["Sample size"]]
n.AaO < sampleN.TOST(CV = CV, design = design,
print = FALSE)[["Sample size"]]
n < make.equal(n = n.TaT, ns = ns)
df < data.frame(design = design,
evaluation = c("\u2018All at Once\u2019",
"\u2018Two at a Time\u2019"),
n = c(n.AaO, n), power = NA)
df[1, 4] < power.TOST(CV = CV, design = design,
n = n.AaO)
df[2, 4] < power.TOST(CV = CV, design = "2x2",
n = n)
print(df, row.names = FALSE)
R> design evaluation n power
R> 3x3 ‘All at Once’ 27 0.8034938
R> 3x3 ‘Two at a Time’ 30 0.8342518
This uprounding from the sample size of the ‘All at Once’ approach protects us also against loss of power due to dropouts.
Since dropouts are common, it makes sense to include / dose more subjects in order to end up with a number of eligible subjects which is not lower than our initial estimate.
Let us explore that in the next section.
We define two supportive functions:
n
will be rounded up to the next multiple of the number of sequences ns
.n
, the anticipated droputrate do.rate
, and the number of sequences ns
.In order to come up with a suggestion we have to anticipate a (realistic!) dropout rate. Note that this not the job of the statistician; ask the Principal Investigator.
“It is a capital mistake to theorise before one has data.
The dropoutrate is calculated from the eligible and dosed subjects
or simply \[\begin{equation}\tag{3}
do.rate=1n_\textrm{eligible}/n_\textrm{dosed}
\end{equation}\] Of course, we know it only after the study was performed.
By substituting \(n_\textrm{eligible}\) with the estimated sample size \(n\), providing an anticipated dropoutrate and rearrangement to find the adjusted number of dosed subjects \(n_\textrm{adj}\) we should use \[\begin{equation}\tag{4} n_\textrm{adj}=\;\upharpoonleft n\,/\,(1do.rate) \end{equation}\] where \(\upharpoonleft\) denotes rounding up to the next even number as implemented in the functions above.
An all too common mistake is to increase the estimated sample size \(n\) by the dropoutrate according to \[\begin{equation}\tag{5} n_\textrm{adj}=\;\upharpoonleft n\times(1+do.rate) \end{equation}\]
If you used \(\small{(5)}\) in the past – you are not alone. In a small survey a whooping 29% of respondents reported to use it.^{20} Consider changing your routine.
“There are no routine statistical questions, only questionable statistical routines.
In the following I specified more arguments to make the function more flexible.
Note that I wrapped the function power.TOST()
in suppressMessages()
. Otherwise, the function will throw for any odd sample size a message telling us that the design is unbalanced. Well, we know that.
CV < 0.25 # withinsubject CV
target < 0.80 # target (desired) power
theta0 < 0.95 # assumed T/Rratio
theta1 < 0.80 # lower BE limit
theta2 < 1.25 # upper BE limit
design < "3x3"
ns < 3
do.rate < 0.15 # anticipated dropoutrate 10%
df < sampleN.TOST(CV = CV, theta0 = theta0,
theta1 = theta1,
theta2 = theta2,
targetpower = target,
design = design,
print = FALSE)
# calculate the adjusted sample size
n.adj < nadj(df[["Sample size"]], do.rate, ns)
# (decreasing) vector of eligible subjects
n.elig < n.adj:df[["Sample size"]]
info < paste0("Assumed CV : ",
CV,
"\nAssumed T/R ratio : ",
theta0,
"\nBE limits : ",
paste(theta1, theta2, sep = "\u2026"),
"\nTarget (desired) power : ",
target,
"\nAnticipated dropoutrate: ",
do.rate,
"\nEstimated sample size : ",
df[["Sample size"]], " (",
df[["Sample size"]]/ns, "/sequence)",
"\nAchieved power : ",
signif(df[["Achieved power"]], 4),
"\nAdjusted sample size : ",
n.adj, " (", n.adj/ns, "/sequence)",
"\n\n")
# explore the potential outcome for
# an increasing number of dropouts
do.act < signif((n.adj  n.elig) / n.adj, 4)
df < data.frame(dosed = n.adj,
eligible = n.elig,
dropouts = n.adj  n.elig,
do.act = do.act,
power = NA)
for (i in 1:nrow(df)) {
df$power[i] < suppressMessages(
power.TOST(CV = CV,
theta0 = theta0,
theta1 = theta1,
theta2 = theta2,
design = design,
n = df$eligible[i]))
}
cat(info); print(round(df, 4), row.names = FALSE)
R> Assumed CV : 0.25
R> Assumed T/R ratio : 0.95
R> BE limits : 0.8…1.25
R> Target (desired) power : 0.8
R> Anticipated dropoutrate: 0.15
R> Estimated sample size : 27 (9/sequence)
R> Achieved power : 0.8035
R> Adjusted sample size : 33 (11/sequence)
R> dosed eligible dropouts do.act power
R> 33 33 0 0.0000 0.8745
R> 33 32 1 0.0303 0.8641
R> 33 31 2 0.0606 0.8536
R> 33 30 3 0.0909 0.8430
R> 33 29 4 0.1212 0.8300
R> 33 28 5 0.1515 0.8168
R> 33 27 6 0.1818 0.8035
In the worst case (6 dropouts) we end up with the originally estimated sample size of 27. Power preserved, mission accomplished. If we have less dropouts, splendid – we gain power.
If we would have adjusted the sample size acc. to \(\small{(5)}\) we would have dosed also 33 subjects like the 33 acc. to \(\small{(4)}\).
Cave: This might not always be the case… If the anticipated dropout rate of 15% is realized in the study, we would have also 28 eligible subjects (power 0.8168). In this example we achieve still more than our target power but the loss might be relevant in other cases.
As said in the preliminaries, calculating post hoc power is futile.
“There is simple intuition behind results like these: If my car made it to the top of the hill, then it is powerful enough to climb that hill; if it didn’t, then it obviously isn’t powerful enough. Retrospective power is an obvious answer to a rather uninteresting question. A more meaningful question is to ask whether the car is powerful enough to climb a particular hill never climbed before; or whether a different car can climb that new hill. Such questions are prospective, not retrospective.
However, sometimes we are interested in it for planning the next study.
If you give and odd total sample size n
, power.TOST()
will try to keep sequences as balanced as possible and show in a message how that was done.
R> Unbalanced design. n(i)=9/9/8 assumed.
R> [1] 0.786887
Say, our study was more unbalanced. Let us assume that we dosed 33 subjects, the total number of subjects was also 26 but all dropouts occured in one sequence (unlikely but possible).
Instead of the total sample size n
we can give the number of subjects of each sequence as a vector (the order is not relevant, i.e., it does not matter which element refers to which sequence).
n.adj < head(df$dosed, 1)
n.act < tail(df$eligible, 1)
do < tail(df$dropouts, 1)
n.eli < rep(n.adj / ns, ns)
n.eli[1] < n.adj / ns  do
post.hoc < signif(power.TOST(CV = 0.25,
n = n.eli,
design = design), 6)
sig.dig < nchar(as.character(n.adj))
fmt < paste0("%", sig.dig, ".0f (%",
sig.dig, ".0f dropouts)")
info < paste0("Dosed subjects: ",
sprintf("%2.0f", n.adj),
"\nEligible : ",
sprintf(fmt, n.act,
n.adj  n.act))
for (i in seq_along(n.eli)) {
info < paste(info, "\n Sequence ", i, ":",
sprintf(fmt, n.eli[i],
n.adj / ns  n.eli[i]))
}
info < paste(info, "\nPower : ",
post.hoc, "\n")
cat(info)
R> Dosed subjects: 33
R> Eligible : 27 ( 6 dropouts)
R> Sequence 1 : 5 ( 6 dropouts)
R> Sequence 2 : 11 ( 0 dropouts)
R> Sequence 3 : 11 ( 0 dropouts)
R> Power : 0.747186
Of course, in a particular study you will provide the numbers in the n
vector directly.
As stated already in the introduction, the CV and the T/Rratio are only assumptions. Whatever their origin might be (literature, previous studies) they carry some degree of uncertainty. Hence, believing^{21} that they are the true ones may be risky.
Some statisticians call that the ‘CarvedinStone’ approach.
Say, we performed a pilot study in 12 subjects and estimated the CV as 0.25.
The \(\alpha\) confidence interval of the CV is obtained via the \(\chi^2\)distribution of its error variance \(\sigma^2\) with \(\small{n2}\) degrees of freedom. \[\begin{matrix}\tag{6} s^2=\log_{e}(CV^2+1)\\ L=\frac{(n1)\,s^2}{\chi_{\alpha/2,\,n2}^{2}}\leq\sigma^2\leq\frac{(n1)\,s^2}{\chi_{1\alpha/2,\,n2}^{2}}=U\\ \left\{lower\;CL,\;upper\;CL\right\}=\left\{\sqrt{\exp(L)1},\sqrt{\exp(U)1}\right\} \end{matrix}\] Let’s calculate the 95% confidence interval of the CV to get an idea.
m < 12 # pilot study
ci < CVCL(CV = 0.25, df = get.df(design, m),
side = "2sided", alpha = 0.05)
signif(ci, 4)
R> lower CL upper CL
R> 0.1901 0.3671
Surprised? Although 0.25 is the best estimate for planning the next study, there is no guarantee that we will get exactly the same outcome. Since the \(\chi^2\)distribution is skewed to the right, it is more likely to get a higher CV than a lower one in the planned study.
If we plan the study based on 0.25, we would opt for 27 subjects like in the examples before (not adjusted for the dropoutrate).
If the CV will be lower, we gain power. Fine.
But what if it will be higher? Of course, we will loose power. But how much?
Let’s explore what might happen at the confidence limits of the CV.
m < 12
ci < CVCL(CV = 0.25, get.df(design, m),
side = "2sided", alpha = 0.05)
n < 27
comp < data.frame(CV = c(ci[["lower CL"]], 0.25,
ci[["upper CL"]]),
power = NA)
for (i in 1:nrow(comp)) {
comp$power[i] < power.TOST(CV = comp$CV[i],
design = design,
n = n)
}
comp[, 1] < signif(comp[, 1], 4)
comp[, 2] < signif(comp[, 2], 6)
print(comp, row.names = FALSE)
R> CV power
R> 0.1901 0.951526
R> 0.2500 0.803494
R> 0.3671 0.417982
Ouch!
What can we do? The larger the previous study was, the larger the degrees of freedom and hence, the narrower the confidence interval of the CV. In simple terms: The estimate is more certain. On the other hand, it also means that very small pilot studies are practically useless.
m < seq(6, 16, 2)
df < data.frame(n = m, CV = 0.25,
l = NA, u = NA)
for (i in 1:nrow(df)) {
df[i, 3:4] < CVCL(CV = 0.25,
df = get.df(design, m[i]),
side = "2sided",
alpha = 0.05)
}
df[, 3:4] < signif(df[, 3:4], 4)
names(df)[3:4] < c("lower CL", "upper CL")
print(df, row.names = FALSE)
R> n CV lower CL upper CL
R> 6 0.25 0.1675 0.4992
R> 8 0.25 0.1779 0.4238
R> 10 0.25 0.1849 0.3883
R> 12 0.25 0.1901 0.3671
R> 14 0.25 0.1940 0.3528
R> 16 0.25 0.1973 0.3425
A Bayesian method is implemented in PowerTOST
, which takes the uncertainty of estimates into account. We can explore the uncertainty of the CV, the T/Rratio, and both.
m < 12 # sample size of pilot
CV < 0.25
theta0 < 0.95
df < expsampleN.TOST(CV = CV, theta0 = theta0,
targetpower = 0.80,
design = design,
prior.parm = list(
m = m, design = design),
prior.type = "CV",
details = FALSE)
R>
R> ++++++++++++ Equivalence test  TOST ++++++++++++
R> Sample size est. with uncertain CV
R> 
R> Study design: 3x3 crossover
R> logtransformed data (multiplicative model)
R>
R> alpha = 0.05, target power = 0.8
R> BE margins = 0.8 ... 1.25
R> Ratio = 0.95
R> CV = 0.25 with 20 df
R>
R> Sample size (ntotal)
R> n exp. power
R> 30 0.804060
Not that bad. 11% more subjects required.
What about an uncertain T/Rratio?
m < 12
CV < 0.25
theta0 < 0.95
df < expsampleN.TOST(CV = CV, theta0 = theta0,
targetpower = 0.80,
design = design,
prior.parm = list(
m = m, design = design),
prior.type = "theta0",
details = FALSE)
R>
R> ++++++++++++ Equivalence test  TOST ++++++++++++
R> Sample size est. with uncertain theta0
R> 
R> Study design: 3x3 crossover
R> logtransformed data (multiplicative model)
R>
R> alpha = 0.05, target power = 0.8
R> BE margins = 0.8 ... 1.25
R> Ratio = 0.95
R> CV = 0.25
R>
R> Sample size (ntotal)
R> n exp. power
R> 183 0.881163
Terrible! What a nightmare.
That should not take us by surprise. We don’t know where the true T/Rratio lies but we can calculate the lower 95% confidence limit of the pilot study’s point estimate to get an idea about a worst case.
m < 12
CV < 0.25
pe < 0.95
ci < round(CI.BE(CV = CV, pe = 0.95,
n = m, design = design), 4)
if (pe <= 1) {
cl < ci[["lower"]]
} else {
cl < ci[["upper"]]
}
print(cl)
R> [1] 0.7988
Exlore the impact of a relatively 5% higher CV and a relatively 5% lower T/Rratio on power for a given sample size.
n < 27
CV < 0.25
theta0 < 0.95
comp1 < data.frame(CV = c(CV, CV*1.05),
power = NA)
comp2 < data.frame(theta0 = c(theta0, theta0*0.95),
power = NA)
for (i in 1:2) {
comp1$power[i] < power.TOST(CV = comp1$CV[i],
theta0 = theta0,
design = design,
n = n)
}
comp1$power < signif(comp1$power, 6)
for (i in 1:2) {
comp2$power[i] < power.TOST(CV = CV,
theta0 = comp2$theta0[i],
design = design,
n = n)
}
comp2$power < signif(comp2$power, 6)
print(comp1, row.names = F); print(comp2, row.names = F)
R> CV power
R> 0.2500 0.803494
R> 0.2625 0.765085
R> theta0 power
R> 0.9500 0.803494
R> 0.9025 0.550795
Interlude 2
Note the logscale of the xaxis. It demonstrates that power curves are symmetrical around 1 (\(\small{\log_{e}(1)=0}\), where \(\small{\log_{e}(\theta_2)=\left\log_{e}(\theta_1)\right}\)) and we will achieve the same power for \(\small{\theta_0}\) and \(\small{1/\theta_0}\) (e.g., for 0.95 and 1.0526). Furthermore, all curves intersect 0.05 at \(\small{\theta_1}\) and \(\small{\theta_2}\), which means that for a true \(\small{\theta_0}\) at one of the limits \(\small{\alpha}\) is strictly controlled.
<nitpick>CV < 0.25
delta < 0.05 # direction unknown
theta0s < c(1  delta, 1 / (1 + delta),
1 + delta, 1 / (1  delta))
n < sampleN.TOST(CV = CV, theta0 = 1  delta,
design = design,
print = FALSE)[["Sample size"]]
comp1 < data.frame(CV = CV, theta0 = theta0s,
base = c(TRUE, rep(FALSE, 3)),
n = n, power = NA)
for (i in 1:nrow(comp1)) {
comp1$power[i] < power.TOST(CV = CV,
theta0 = comp1$theta0[i],
n = n,
design = design)
}
n < sampleN.TOST(CV = CV, theta0 = 1 + delta,
design = design,
print = FALSE)[["Sample size"]]
comp2 < data.frame(CV = CV, theta0 = theta0s,
base = c(FALSE, FALSE, TRUE, FALSE),
n = n, power = NA)
for (i in 1:nrow(comp2)) {
comp2$power[i] < power.TOST(CV = CV,
design = design,
theta0 = comp2$theta0[i],
n = n)
}
comp1[, c(2, 5)] < signif(comp1[, c(2, 5)] , 4)
comp2[, c(2, 5)] < signif(comp2[, c(2, 5)] , 4)
print(comp1, row.names = F); print(comp2, row.names = F)
R> CV theta0 base n power
R> 0.25 0.9500 TRUE 27 0.8035
R> 0.25 0.9524 FALSE 27 0.8124
R> 0.25 1.0500 FALSE 27 0.8124
R> 0.25 1.0530 FALSE 27 0.8035
R> CV theta0 base n power
R> 0.25 0.9500 FALSE 27 0.8035
R> 0.25 0.9524 FALSE 27 0.8124
R> 0.25 1.0500 TRUE 27 0.8124
R> 0.25 1.0530 FALSE 27 0.8035
</nitpick>
Since power is much more sensitive to the T/Rratio than to the CV, the results obtained with the Bayesian method should be clear now.
Essentially this leads to the murky waters of prospective sensitivity analyses, which is covered in another article.
An appetizer to show the maximum deviations (CV, T/Rratio and decreased sample size due to dropouts) which give still a minimum acceptable power of ≥ 0.70:
CV < 0.25
theta0 < 0.95
target < 0.80
minpower < 0.70
pa < pa.ABE(CV = CV, theta0 = theta0,
targetpower = target,
minpower = minpower,
design = design)
change.CV < 100*(tail(pa$paCV[["CV"]], 1) 
pa$plan$CV) / pa$plan$CV
change.theta0 < 100*(head(pa$paGMR$theta0, 1) 
pa$plan$theta0) /
pa$plan$theta0
change.n < 100*(tail(pa$paN[["N"]], 1) 
pa$plan[["Sample size"]]) /
pa$plan[["Sample size"]]
comp < data.frame(parameter = c("CV", "theta0", "n"),
change = c(change.CV,
change.theta0,
change.n))
comp$change < sprintf("%+.2f%%", comp$change)
names(comp)[2] < "relative change"
print(pa, plotit = FALSE); print(comp, row.names = FALSE)
R> Sample size plan ABE
R> Design alpha CV theta0 theta1 theta2 Sample size
R> 3x3 0.05 0.25 0.95 0.8 1.25 27
R> Achieved power
R> 0.8034938
R>
R> Power analysis
R> CV, theta0 and number of subjects leading to min. acceptable power of =0.7:
R> CV= 0.2827, theta0= 0.9276
R> n = 22 (power= 0.7106)
R> parameter relative change
R> CV +13.09%
R> theta0 2.36%
R> n 18.52%
Confirms what we have seen above. Interesting that the sample size is the least sensitive one. Many overrate the impact of dropouts on power.
If you didn’t stop reading in desperation (understandable), explore both uncertain CV and T/Rratio:
m < 12
CV < 0.25
theta0 < 0.95
df < expsampleN.TOST(CV = CV, theta0 = theta0,
targetpower = 0.80,
design = design,
prior.parm = list(
m = m, design = design),
prior.type = "both",
details = FALSE)
R>
R> ++++++++++++ Equivalence test  TOST ++++++++++++
R> Sample size est. with uncertain CV and theta0
R> 
R> Study design: 3x3 crossover
R> logtransformed data (multiplicative model)
R>
R> alpha = 0.05, target power = 0.8
R> BE margins = 0.8 ... 1.25
R> Ratio = 0.95 with 20 df
R> CV = 0.25 with 20 df
R>
R> Sample size (ntotal)
R> n exp. power
R> 198 0.866206
I don’t suggest that you should use it in practice. AFAIK, not even the main author of this function (Benjamin Lang) does. However, it serves educational purposes to show that it is not that easy and why even properly planned studies might fail.
An alternative to assuming a T/Rratio is based on statistical assurance.^{22} This concept uses the distribution of T/Rratios and assumes an uncertainty parameter \(\small{\sigma_\textrm{u}}\). A natural assumption is \(\small{\sigma_\textrm{u}=1\theta_0}\), i.e., for the commonly applied T/Rratio of 0.95 one can use the argument sem = 0.05
in the function expsampleN.TOST()
, where the argument theta0
has to be fixed at 1.
CV < 0.25
theta0 < 0.95
target < 0.80
sigma.u < 1  theta0
comp < data.frame(theta0 = theta0,
n.1 = NA, power = NA,
sigma.u = sigma.u,
n.2 = NA, assurance = NA)
comp[2:3] < sampleN.TOST(CV = CV,
targetpower = target,
theta0 = theta0,
design = design,
details = FALSE,
print = FALSE)[7:8]
comp[5:6] < expsampleN.TOST(CV = CV,
theta0 = 1, # fixed!
targetpower = target,
design = design,
prior.type = "theta0",
prior.parm =
list(sem = sigma.u),
details = FALSE,
print = FALSE)[9:10]
names(comp)[c(2, 5)] < rep("n", 2)
print(signif(comp, 6), row.names = FALSE)
R> theta0 n power sigma.u n assurance
R> 0.95 27 0.803494 0.05 27 0.813331
It is not unusal that equivalence of more than one endpoint has to be demonstrated. In bioequivalence the pharmacokinetic metrics C_{max} and AUC_{0–t} are mandatory (in some jurisdictions like the FDA additionally AUC_{0–∞}).
We don’t have to worry about multiplicity issues (inflated Type I Error) since if all tests must pass at level \(\alpha\), we are protected by the intersectionunion principle.^{23} ^{24}
We design the study always for the worst case combination, i.e., based on the PK metric requiring the largest sample size. In some jurisdictions wider BE limits for C_{max} are acceptable. Let’s explore that with different CVs and T/Rratios.
metrics < c("Cmax", "AUCt", "AUCinf")
CV < c(0.25, 0.19, 0.20)
theta0 < c(0.95, 1.04, 1.06)
theta1 < c(0.75, 0.80, 0.80)
theta2 < 1 / theta1
target < 0.80
df < data.frame(metric = metrics,
theta1 = theta1, theta2 = theta2,
CV = CV, theta0 = theta0, n = NA)
for (i in 1:nrow(df)) {
df$n[i] < sampleN.TOST(CV = CV[i],
theta0 = theta0[i],
theta1 = theta1[i],
theta2 = theta2[i],
targetpower = target,
design = design,
print = FALSE)[["Sample size"]]
}
df$theta1 < sprintf("%.4f", df$theta1)
df$theta2 < sprintf("%.4f", df$theta2)
txt < paste0("Sample size based on ",
df$metric[df$n == max(df$n)], ".\n")
print(df, row.names = FALSE); cat(txt)
R> metric theta1 theta2 CV theta0 n
R> Cmax 0.7500 1.3333 0.25 0.95 18
R> AUCt 0.8000 1.2500 0.19 1.04 15
R> AUCinf 0.8000 1.2500 0.20 1.06 21
R> Sample size based on AUCinf.
Even if we assume the same T/Rratio for two PK metrics, we will get a wider margin for the one with lower variability.
Let’s continue with the conditions of our previous examples, this time assuming that the CV and T/Rratio were applicable for C_{max}. As common in PK, the CV of AUC is lower, say only 0.20. That means, the study is ‘overpowered’ for the assumed T/Rratio of AUC.
Which are the extreme T/Rratios (largest deviations of T from R) giving still the target power?
opt < function(x) {
power.TOST(theta0 = x, CV = df$CV[2],
theta1 = theta1,
theta2 = theta2,
n = df$n[1],
design = design)  target
}
metrics < c("Cmax", "AUC")
CV < c(0.25, 0.20) # Cmax, AUC
theta0 < 0.95 # both metrics
theta1 < 0.80
theta2 < 1.25
target < 0.80
df < data.frame(metric = metrics, theta0 = theta0,
CV = CV, n = NA, power = NA)
for (i in 1:nrow(df)) {
df[i, 4:5] < sampleN.TOST(CV = CV[i],
theta0 = theta0,
theta1 = theta1,
theta2 = theta2,
targetpower = target,
design = design,
print = FALSE)[7:8]
}
df$power < signif(df$power, 6)
if (theta0 < 1) {
res < uniroot(opt, tol = 1e8,
interval = c(theta1 + 1e4, theta0))
} else {
res < uniroot(opt, tol = 1e8,
interval = c(theta0, theta2  1e4))
}
res < unlist(res)
theta0s < c(res[["root"]], 1/res[["root"]])
txt < paste0("Target power for ", metrics[2],
" and sample size ",
df$n[1], "\nachieved for theta0 ",
sprintf("%.4f", theta0s[1]), " or ",
sprintf("%.4f", theta0s[2]), ".\n")
print(df, row.names = FALSE); cat(txt)
R> metric theta0 CV n power
R> Cmax 0.95 0.25 27 0.803494
R> AUC 0.95 0.20 18 0.808949
R> Target power for AUC and sample size 27
R> achieved for theta0 0.9164 or 1.0912.
That means, although we assumed for AUC the same T/Rratio as for C_{max} – with the sample size of 27 required for C_{max} – for AUC it can be as low as ~0.92 or as high as ~1.09, which is a soothing sideeffect.
Furthermore, sometimes we have less data of AUC than of C_{max} (samples at the end of the profile missing or unreliable estimation of \(\small{\lambda_\textrm{z}}\) in some subjects and therefore, less data of AUC_{0–∞} than of AUC_{o–t}). Again, it will not hurt because for the originally assumed T/Rratio we need only 18 subjects.
Since – as a onepoint metric – C_{max} is inherently more variable than AUC, Health Canada does not require assessment of its confidence interval – only the point estimate has to lie within 80.0–125.0%. We can explore that by setting alpha = 0.5
for it.^{25}
metrics < c("Cmax", "AUC")
CV < c(0.60, 0.20)
theta0 < 0.95
target < 0.80
alpha < c(0.50, 0.05)
df < data.frame(metric = metrics, CV = CV, theta0 = theta0,
alpha = alpha, n = NA)
for (i in 1:nrow(df)) {
df$n[i] < sampleN.TOST(alpha = alpha[i],
CV = CV[i],
theta0 = theta0,
targetpower = target,
design = design,
print = FALSE)[["Sample size"]]
}
txt < paste0("Sample size based on ",
df$metric[df$n == max(df$n)], ".\n")
print(df, row.names = FALSE); cat(txt)
R> metric CV theta0 alpha n
R> Cmax 0.6 0.95 0.50 24
R> AUC 0.2 0.95 0.05 18
R> Sample size based on Cmax.
Only with a relatively high CV (compared to the one of AUC) you may face a situation where you have to base the sample size on C_{max}.
First of all we have have to consider the purpose of the study and how we will evaluate it (see above).
The wellknown Bonferroniadjustment can be used \[\begin{equation}\tag{6} \alpha_\textrm{adj}=\alpha\,/\,k \end{equation}\] where \(\small{k}\) is the number of simultaneous tests. The type I error \(\small{TIE}\) will always be controlled because \[\begin{equation}\tag{7} TIE=1{(1\alpha_\textrm{adj})}^k \end{equation}\] For \(\small{\alpha=0.05}\) and \(\small{k=2}\) we get \(\small{0.049375<\alpha}\).
Of course, all comparisons in the study should be done by \(\small{100(12\,\alpha_\textrm{adj})}\) confidence intervals.
Extending the basic example and using the argument alpha
:
k < 2 # comparisons
alpha.adj < 0.05 / k # Bonferroni
sampleN.TOST(alpha = alpha.adj, CV = 0.25,
theta0 = 0.95, targetpower = 0.80,
design = design)
R>
R> +++++++++++ Equivalence test  TOST +++++++++++
R> Sample size estimation
R> 
R> Study design: 3x3 crossover
R> logtransformed data (multiplicative model)
R>
R> alpha = 0.025, target power = 0.8
R> BE margins = 0.8 ... 1.25
R> True ratio = 0.95, CV = 0.25
R>
R> Sample size (total)
R> n power
R> 36 0.827798
The sample size increased by ~33% from the 27 subjects required in a single comparison.
With an increasing number of comparisons, the Bonferroniadjustment quickly becomes overly conservative. Another issue is that the tests are not strictly independent (at least when we compare different tests to the same reference treatment some – unknown – correlation exists). More powerful methods (Holm, Hochberg, …) are out of scope of this article.
Three treatments intended for evaluation ‘All at Once’ or ‘Two at a Time’. Equivalence is seeked for both (say a tablet and a capsule vs a reference).
CV < 0.25
df < data.frame(design = c("3x3", "3x6x3",
"2x2x2"),
n = NA, power = NA)
for (i in 1:nrow(df)) {
df[i, 2:3] < sampleN.TOST(alpha = 0.05 / 2,
CV = CV,
design = df$design[i],
print = FALSE)[7:8]
}
df$power < signif(df$power, 4)
print(df, row.names = FALSE)
R> design n power
R> 3x3 36 0.8278
R> 3x6x3 36 0.8278
R> 2x2x2 36 0.8161
Four treatments (e.g., Test fasting and fed state, Reference fasting and fed state). Two comparisons are confirmatory (Test vs Reference fasting, Test vs Reference fed) and two are exploratory (Test fed vs fasting, Reference fed vs fasting). We have to adjust only for the former since the latter are only ‘nice to know’.
CV < 0.25
df < data.frame(design = c("4x4", "2x2x2"),
n = NA, power = NA)
for (i in 1:nrow(df)) {
df[i, 2:3] < sampleN.TOST(alpha = 0.05 / 2,
CV = CV,
design = df$design[i],
print = FALSE)[7:8]
}
df$power < signif(df$power, 4)
print(df, row.names = FALSE)
R> design n power
R> 4x4 36 0.8315
R> 2x2x2 36 0.8161
So far we employed the common (and hence, default) BElimits theta1 = 0.80
and theta2 = 1.25
. In some jurisdictions tighter limits of 90.00 – 111.11% have to be used.^{26}
Generally NTIDs show a low withinsubject variability (though the betweensubject CV can be much higher – these drugs require quite often dosetitration).
You have to provide only the lower BElimit theta1
(the upper one will be automatically calculated). In the examples I used a lower CV of 0.125.
R>
R> +++++++++++ Equivalence test  TOST +++++++++++
R> Sample size estimation
R> 
R> Study design: 3x3 crossover
R> logtransformed data (multiplicative model)
R>
R> alpha = 0.05, target power = 0.8
R> BE margins = 0.9 ... 1.111111
R> True ratio = 0.95, CV = 0.125
R>
R> Sample size (total)
R> n power
R> 69 0.813991
Not so nice. If you would exhaust the capacity of the clinical site, you may consider a replicate design.
Interlude 3
Health Canada requires for NTIDs (termed by HC ‘critical dose drugs’) that the confidence interval of C_{max} lies within 80.0–125.0% and the one of AUC within 90.0–112.0%.
<nitpick></nitpick>
Hence, for Health Canada you have to specify both limits.
R>
R> +++++++++++ Equivalence test  TOST +++++++++++
R> Sample size estimation
R> 
R> Study design: 3x3 crossover
R> logtransformed data (multiplicative model)
R>
R> alpha = 0.05, target power = 0.8
R> BE margins = 0.9 ... 1.12
R> True ratio = 0.95, CV = 0.125
R>
R> Sample size (total)
R> n power
R> 69 0.813991
However, only rarely a bit lower than with the common limits for NTIDs (here we get the same sample size).
The FDA requires for NTIDs tighter batchrelease specifications (±5% instead of ±10%). Let’s hope that your product complies and the T/Rratio will be closer to 1:
R>
R> +++++++++++ Equivalence test  TOST +++++++++++
R> Sample size estimation
R> 
R> Study design: 3x3 crossover
R> logtransformed data (multiplicative model)
R>
R> alpha = 0.05, target power = 0.8
R> BE margins = 0.9 ... 1.111111
R> True ratio = 0.975, CV = 0.125
R>
R> Sample size (total)
R> n power
R> 33 0.820725
Substantially lower sample and doable.
Sometimes we are interested in assessing differences of responses instead of their ratios. In such a case we have to set logscale = FALSE
. The limits theta1
and theta2
can be expressed in the following ways:
Note that in the former case the units of CV
, and theta0
need also to be given relative to the reference mean (specified as ratio).
Let’s estimate the sample size for an equivalence trial of two blood pressure lowering drugs assessing the difference in means of untransformed data (raw, linear scale). In this setup everything has to be given with the same units (i.e., here the assumed difference –5 mm Hg and the lower / upper limits –15 mm Hg / +15 mm Hg systolic blood pressure). Furthermore, we assume a CV of 25 mm Hg.
R>
R> +++++++++++ Equivalence test  TOST +++++++++++
R> Sample size estimation
R> 
R> Study design: 3x3 crossover
R> untransformed data (additive model)
R>
R> alpha = 0.05, target power = 0.8
R> BE margins = 15 ... 15
R> True diff. = 5, CV = 20
R>
R> Sample size (total)
R> n power
R> 51 0.805426
Sometimes in the literature we find not the CV but the standard deviation of the difference. Say, it is given with 36 mm Hg. We have to convert it to a CV. In all implemented HigherOrder Crossover designs it’s \(\small{SD/\sqrt{2}}\).
SD.delta < 36
sampleN.TOST(CV = SD.delta / sqrt(2), theta0 = 5,
theta1 = 15, theta2 = +15,
design = design,
logscale = FALSE)
R>
R> +++++++++++ Equivalence test  TOST +++++++++++
R> Sample size estimation
R> 
R> Study design: 3x3 crossover
R> untransformed data (additive model)
R>
R> alpha = 0.05, target power = 0.8
R> BE margins = 15 ... 15
R> True diff. = 5, CV = 25.45584
R>
R> Sample size (total)
R> n power
R> 81 0.800354
Note that other software packages (e.g., PASS, nQuery, StudySize, …) require the standard deviation of the difference as input.
It would be a fundamental flaw if responses are assumed to a follow a normal distribution and not – like above – assess their difference and at the end instead giving an estimate of the difference, calculate the ratio of the test and referencemeans.
In such a case Fieller’s (‘fiducial’) confidence interval^{27} has to be employed.
Note that in the functions sampleN.RatioF()
and power.RatioF()
alpha = 0.025
is the default, since it is intended for studies with clinical endpoints, where the 95% confidence interval is usually used for equivalence testing.^{28} Only the ‘Two at a Time’ approach is implemented, i.e., you have to specify design = "2x2"
. Sometimes we have to round the sample size up (see the interlude). Note that we need additionally the betweensubject CV (here arbitrarily twice the CV).
planned < "3x3"
CV < 0.25
CVb < 2 * CV
get.sequences < function(design) {
known < known.designs()[, c(2, 5)]
ns < as.integer(known[known$design == design,
"steps"])
return(ns)
}
make.equal < function(n, ns) {
return(as.integer(ns * (n %/% ns + as.logical(n %% ns))))
}
ns < get.sequences(planned)
n < sampleN.RatioF(CV = CV, CVb = CVb, design = "2x2",
print = FALSE)[["Sample size"]]
n < make.equal(n = n, ns = ns)
df < data.frame(Design = planned, Eval = "2x2",
alpha = 0.025, CV = CV, CVb = CVb, n = n)
df$power < suppressMessages(
power.RatioF(CV = CV, CVb = CVb,
design = "2x2", n = n))
names(df)[6:7] < c("Sample size", "Achieved power")
print(df, row.names = FALSE)
R> Design Eval alpha CV CVb Sample size Achieved power
R> 3x3 2x2 0.025 0.25 0.5 42 0.8021525
“He who seeks for methods without having a definite problem in mind seeks in the most part in vain.
With a few exceptions (i.e., simulationbased methods), in PowerTOST
the default method = "exact"
implements Owen’s Q function^{29} which is also used in SAS’ Proc Power
.
Other implemented methods are "mvt"
(based on the bivariate noncentral tdistribution), "noncentral"
/ "nct"
(noncentral tdistribution), and "shifted"
/ "central"
(shifted central tdistribution). Although "mvt"
is also exact, it may have a somewhat lower precision compared to Owen’s Q and has a much longer runtime.
Let’s compare them.
methods < c("exact", "mvt", "noncentral", "central")
df < data.frame(method = methods,
power = rep(NA, 4))
for (i in 1:nrow(df)) {
df$power[i] < power.TOST(CV = 0.25,
n = 27,
design = design,
method = df$method[i])
}
df$power < round(df$power, 5)
print(df, row.names = FALSE)
R> method power
R> exact 0.80349
R> mvt 0.80349
R> noncentral 0.80349
R> central 0.80113
Power approximated by the shifted central tdistribution is generally slightly lower compared to the others. Hence, if used in sample size estimations, occasionally one more complete sequence is ‘required’.
Therefore, I recommend to use "method = shifted"
/ "method = central"
only for comparing with old results (literature, own studies).
Q: Can we use R in a regulated environment and is PowerTOST
validated?
A: About the acceptability of Base R see ‘A Guidance Document for the Use of R in Regulated Clinical Trial Environments’.
The authors of PowerTOST
tried to do their best to provide reliable and valid results. The ‘NEWS’file on CRAN documents the development of the package, bugfixes, and introduction of new methods.
Validation of any software (yes, of SAS as well…) lies in the hands of the user.
Q: Shall we throw away our sample size tables?
A: Not at all. File them in your archives to collect dust. Maybe in the future you will be asked by an agency how you arrived at a sample size. But: Don’t use them any more. What you should not do (and hopefully haven’t done before): Interpolate. Power and therefore, the sample size depends in a highly nonlinear fashion on the five conditions listed above, which makes interpolation of values given in table a nontrivial job.
Q: Which of the methods should we use in our daily practice?
A: method = exact
. Full stop. Why rely on approximations? Since it is the default in sampleN.TOST()
and power.TOST()
, you don’t have to give this argument (saves keystrokes).
Q: I fail to understand your example about dropouts. We finish the study with 27 eligible subjects as desired. Why is the dropoutrate ~18% and not the anticipated 10%?
A: That’s due to rounding up the calculated adjusted sample size (31.76…) to the next complete sequence (33).
If you manage it to dose fractional subjects (I can’t) your dropout rate would indeed equal the anticipated one: 100(1 – 27/31.76…) = 10%. ⬜
Q: Do we have to worry about unbalanced sequences?
A: sampleN.TOST()
will always give the total number of subjects for balanced sequences.
If you are interested in post hoc power, give the sample size as a vector, i.e., power.TOST(..., n = c(foo, bar, ...)
, where foo
, bar
, ...
are the number of subjects in each sequence.
Q: Is it possible to simulate power of studies?
A: That’s not necessary, since the available methods provide analytical solutions. However, if you don’t trust them, simulations are possible with the function power.TOST.sim()
, which employs the distributional properties:
\(\small{\sigma^2}\) follows a \(\small{\chi^2}\)distribution with \(\small{n2}\) degrees of freedom and \(\small{\log_{e}(\theta_0)}\) follows a normal distribution).^{30}
Convergence takes a while. Empiric power, its standard error, its relative error compared to the exact power, and execution times on a Xeon E31245v3 3.4 GHz, 8 MB cache, 16 GB RAM, 64 bit R 4.1.0 on Windows 7:
CV < 0.25
des < "3x3"
n < sampleN.TOST(CV = CV, design = des,
print = FALSE)[["Sample size"]]
nsims < c(1e5, 5e5, 1e6, 5e6, 1e7, 5e7, 1e8)
exact < power.TOST(CV =CV, n = n, design = des)
df < data.frame(simulations = nsims,
exact = rep(exact, length(nsims)),
simulated = NA, SE = NA, RE = NA)
for (i in 1:nrow(df)) {
start.time < proc.time()[[3]]
df$simulated[i] < power.TOST.sim(CV = CV, n = n,
design = des,
nsims = nsims[i])
df$secs[i] < proc.time()[[3]]  start.time
df$SE[i] < sqrt(0.025 / i)
df$RE < 100 * (df$simulated  exact) / exact
}
df$exact < signif(df$exact, 5)
df$SE < signif(df$SE, 4)
df$RE < sprintf("%+.4f%%", df$RE)
df$simulated < signif(df$simulated, 5)
df$simulations < formatC(df$simulations, format = "f",
digits = 0, big.mark=",")
names(df)[c(1, 3)] < c("sim\u2019s", "sim\u2019d")
print(df, row.names = FALSE)
R> sim’s exact sim’d SE RE secs
R> 100,000 0.80349 0.80222 0.15810 0.1585% 0.12
R> 500,000 0.80349 0.80333 0.11180 0.0209% 0.59
R> 1,000,000 0.80349 0.80338 0.09129 0.0145% 1.30
R> 5,000,000 0.80349 0.80353 0.07906 +0.0040% 6.22
R> 10,000,000 0.80349 0.80342 0.07071 0.0094% 12.22
R> 50,000,000 0.80349 0.80353 0.06455 +0.0046% 60.35
R> 100,000,000 0.80349 0.80350 0.05976 +0.0007% 119.53
In the Monte Carlo method values approach the exact one asymptotically.
License
Helmut Schütz 2021
1^{st} version March 17, 2021.
Rendered 20210604 16:25:26 CEST by rmarkdown in 1.64 seconds.
Footnotes and References
Labes D, Schütz H, Lang B. PowerTOST: Power and Sample Size for (Bio)Equivalence Studies. 20210118. CRAN.↩︎
Labes D, Schütz H, Lang B. Package ‘PowerTOST’. January 18, 2021. CRAN.↩︎
If we plan all studies for 80% power, treatments are equivalent, and all assumptions are exactly realized, one out of five studies will fail by pure chance.
Science is a cruel mistress.↩︎
U.S. FDA, CDER. Guidance for Industry. Statistical Approaches Establishing Bioequivalence. January 2001. APPENDIX C.↩︎
According to the WHO:
»The a posteriori power of the study does not need to be calculated. The power of interest is that calculated before the study is conducted to ensure that the adequate sample size has been selected. […] The relevant power is the power to show equivalence within the predefined acceptance range.«↩︎
Fuglsang A. Pilot and Repeat Trials as Development Tools Associated with Demonstration of Bioequivalence. AAPS J. 2015; 17(3): 678–83. doi:10.1208/s1224801597446. Free Full Text.↩︎
Hoenig JM, Heisey DM. The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. Am Stat. 2001; 55(1): 19–24. doi:10.1198/000313001300339897.↩︎
There is no statistical method to ‘correct’ for unequal carryover. It can only be avoided by design, i.e., a sufficiently long washout between periods. According to the guidelines subjects with predose concentrations > 5% of their C_{max} can by excluded from the comparison if stated in the protocol. For details see another article.↩︎
Julious SA. Sample Sizes for Clinical Trials. Chapter 7.2. Boca Raton: CRC Press; 2010.↩︎
Common batch release specifications are ±10% of the declared content. Of course, you will try to find a batch of the reference product which is as close as possible to the test (the EMA recommends ≤5%). However, keep analytical (in)accuracy and (im)precision in mind. Furthermore, the analytical method was only validated for the test. Try to get a CoA from the originator… Good luck. You never can be sure. Only Health Canada allows a doseadjustment in the evaluation of the study.↩︎
Zhang P. A Simple Formula for Sample Size Calculation in Equivalence Studies. J Biopharm Stat. 2003; 13(3): 529–538. doi:10.1081/BIP120022772.↩︎
Let us assume that the – whilst unknown – within subject CV of the two reference treatments are different (CV_{wR1} ≠ CV_{wR1}). Since in the ‘All at Once’ approach the pooled variance is used, the confidence interval of the treatment comparisons may be wider than in the ‘Two at a Time’ approach.↩︎
Since the populations differ and the risk for both is still \(\small{\alpha}\) – apart for tourists buying their product in another region…↩︎
Whilst I used the former argument for many years, nowadays I’m leaning towards the latter.↩︎
EMA, CHMP. Guideline on the Investigation of Bioequivalence. CPMP/EWP/QWP/1401/98 Rev. 1/Corr **. London. 20 January 2010.↩︎
David Brown. Presentation at the 3^{rd} EGA Symposium on Bioequivalence. London. June 2010. slides.↩︎
European Generic Medicines Association. Revised EMA Bioequivalence Guideline. Questions & Answers. download.↩︎
D’Angelo P. Testing for Bioequivalence in Higher‐Order Crossover Designs: Two‐at‐a‐Time Principle Versus Pooled ANOVA. 2^{nd} Conference of the Global Bioequivalence Harmonisation Initiative. Rockville, MD. 15–16 September, 2016.↩︎
Schütz H. Sample Size Estimation in Bioequivalence. Evaluation. BEBA Forum. 20201023.↩︎
Quoting my late father: »If you believe, go to church.«↩︎
Ring A, Lang B, Kazaroho C, Labes D, Schall R, Schütz H. Sample size determination in bioequivalence studies using statistical assurance. Br J Clin Pharmacol. 2019; 85(10): 2369–77. doi:10.1111/bcp.14055. Open Access.↩︎
Berger RL, Hsu JC. Bioequivalence Trials, IntersectionUnion Tests and Equivalence Confidence Sets. Stat Sci. 1996; 11(4): 283–302. JSTOR:2246021.↩︎
Zeng A. The TOST confidence intervals and the coverage probabilities with R simulation. March 14, 2014.↩︎
With alpha = 0.5
the test effectively reduces to a pass/fail assessment.↩︎
The FDA recommends ReferenceScaled Average Bioequivalence (RSABE) and a comparison of the withinsubject variances of treatments, which requires a fully replicated design. It will be covered in another article.↩︎
Fieller EC. Some Problems In Interval Estimation. J Royal Stat Soc B. 1954; 16(2): 175–85. JSTOR:2984043.↩︎
EMEA, CPMP. Points to Consider on Switching between Superiority and NonInferiority. CPMP/EWP/482/99. London. 27 July 2000.↩︎
Owen DB. A special case of a bivariate noncentral tdistribution. Biometrika. 1965; 52(3/4): 437–46. doi:10.2307/2333696.↩︎
Zheng C, Wang J, Zhao L. Testing bioequivalence for multiple formulations with power and sample size calculations. Pharm Stat. 2012; 11(4): 334–41. doi:10.1002/pst.1522.↩︎