Consider allowing JavaScript. Otherwise, you have to be proficient in reading since formulas will not be rendered. Furthermore, the table of contents in the left column for navigation will not be available and codefolding not supported. Sorry for the inconvenience.
Examples in this article were generated with
4.1.3
by the package PowerTOST
.^{1}
More examples are given in the respective vignette.^{2} See also the README on GitHub for an overview and the online manual^{3} for details.
Abbreviation  Meaning 

\(\small{\alpha}\)  Nominal level of the test, probability of Type I Error (patient’s risk) 
(A)BE  (Average) Bioequivalence 
ABEL  Average Bioequivalence with Expanding Limits 
\(\small{\beta}\)  Probability of Type II Error (producer’s risk), where \(\small{\beta=1\pi}\) 
\(\small{CV_\textrm{b}}\)  Betweensubject Coefficient of Variation 
\(\small{CV_\textrm{w}}\)  (Pooled) withinsubject Coefficient of Variation 
\(\small{CV_\textrm{wR},\;CV_\textrm{wT}}\)  Observed withinsubject Coefficient of Variation of the Reference and Test product 
\(\small{H_0}\)  Null hypothesis (inequivalence) 
\(\small{H_1}\)  Alternative hypothesis (equivalence) 
\(\small{\mu_\textrm{T}/\mu_\textrm{R}}\)  True T/Rratio 
HVD(P)  Highly Variable Drug (Product) 
\(\small{\pi}\)  (Prospective) power, where \(\small{\pi=1\beta}\) 
PE  Point Estimate 
R  Reference product 
RSABE  Referencescaled Average Bioequivalence 
SABE  Scaled Average Bioequivalence 
\(\small{s_\textrm{bR}^2,\;s_\textrm{bT}^2}\)  Observed betweensubject variance of the Reference and Test product 
\(\small{s_\textrm{wR},\;s_\textrm{wT}}\)  Observed withinsubject standard deviation of the Reference and Test product 
\(\small{s_\textrm{wR}^2,\;s_\textrm{wT}^2}\)  Observed withinsubject variance of the Reference and Test product 
\(\small{\sigma_\textrm{wR}}\)  True withinsubject standard deviation of the Reference product 
T  Test product 
\(\small{\theta_0}\)  True T/Rratio 
\(\small{\theta_1,\;\theta_2}\)  Fixed lower and upper limits of the acceptance range (generally 80.00 – 125.00%) 
\(\theta_\textrm{s}\)  Regulatory constant (≈0.8925742…) 
\(\small{\theta_{\textrm{s}_1},\;\theta_{\textrm{s}_2}}\)  Scaled lower and upper limits of the acceptance range 
TIE  Type I Error (patient’s risk) 
TIIE  Type II Error (producer’s risk: 1 – power) 
What is Referencescaled Average Bioequivalence?
For details about inferential statistics and hypotheses in equivalence see another article.
Definitions:
The concept of Scaled Average Bioequivalence (SABE) for HVD(P)s is based on the following considerations:
The conventional confidence interval inclusion approach in ABE \[\begin{matrix}\tag{1} \theta_1=1\Delta,\theta_2=\left(1\Delta\right)^{1}\\ H_0:\;\frac{\mu_\textrm{T}}{\mu_\textrm{R}}\ni\left\{\theta_1,\,\theta_2\right\}\;vs\;H_1:\;\theta_1<\frac{\mu_\textrm{T}}{\mu_\textrm{R}}<\theta_2, \end{matrix}\] where \(\small{H_0}\) is the null hypothesis of inequivalence and \(\small{H_1}\) the alternative hypothesis, \(\small{\theta_1}\) and \(\small{\theta_2}\) are the fixed lower and upper limits of the acceptance range, and \(\small{\mu_\textrm{T}}\) are the geometric least squares means of \(\small{\textrm{T}}\) and \(\small{\textrm{R}}\), respectively is in Scaled Average Bioequivalence (SABE)^{5} modified to \[H_0:\;\frac{\mu_\textrm{T}}{\mu_\textrm{R}}\Big{/}\sigma_\textrm{wR}\ni\left\{\theta_{\textrm{s}_1},\,\theta_{\textrm{s}_2}\right\}\;vs\;H_1:\;\theta_{\textrm{s}_1}<\frac{\mu_\textrm{T}}{\mu_\textrm{R}}\Big{/}\sigma_\textrm{wR}<\theta_{\textrm{s}_2},\tag{2}\] where \(\small{\sigma_\textrm{wR}}\) is the standard deviation of the reference and the scaled limits \(\small{\left\{\theta_{\textrm{s}_1},\,\theta_{\textrm{s}_2}\right\}}\) of the acceptance range depend on conditions given by the agency.
ReferenceScaled Average Bioequivalence (RSABE) for HVD(P)s is recommended by the FDA^{6} and China’s CDE.^{7}
Alas, we are far away from global harmonization. Average Bioequivalence with Expanding Limits (ABEL) is another variant of SABE and recommended in all other jurisdictions accepting SABE. For details see another article.
In order to apply RSABE following conditions have to be fulfilled:
The approach given in the guidances^{6 7} is a decision scheme which hinges on the estimated standard deviation of the reference treatment \(\small{s_{\textrm{wR}}}\). If \(\small{s_\textrm{wR}<0.294}\) the study has to be assessed for ABE (left branch) and for RSABE (right branch) otherwise. In the RSABEbranch the point estimate (\(\small{PE}\)) has to lie within 80.00 – 125.00%.
Based on the switching standard deviation \(\small{s_0=0.25}\) we get the regulatory constant \(\small{\theta_\textrm{s}=\frac{\log_{e}1.25}{s_0}=0.8925742\ldots}\) and finally the ‘implied limits’ \(\small{\left\{\theta_{\textrm{s}_1},\theta_{\textrm{s}_2}\right\}=100\left(\exp(\mp0.8925742\cdot s_{\textrm{wR}})\right)}\).^{10}
Note that scaling starts at the switching standard deviation \(\small{s_0=0.25}\) (\(\small{CV_0\approx25.396\%}\)) but must be only applied if \(\small{s_{\textrm{wR}}\geq0.294}\) (\(\small{CV_{\textrm{wR}}\geq\approx0.30047}\)), thus explaining the dichotomy^{11} at this value.
Note also that – contrary to ABEL – there is no upper cap of scaling. However, for very large variability the decision is mainly lead by the \(\small{PE}\)constraint.
Since the applicability of this approach depends on the realized values (\(\small{s_\textrm{wR}}\), \(\small{PE}\)) in the particular study – which are naturally unknown beforehand – analytical solutions for power (and hence, the sample size) do not exist.
Therefore, extensive simulations of potential combinations have to be employed.
Cave: If \(\small{s_{\textrm{wR}}<0.294}\) the method may lead to an inflated Type I Error (increased patient’s risk). It is elaborated in another article.
A basic knowledge of R is
required. To run the scripts at least version 1.4.8 (20190829) of
PowerTOST
is required and 1.5.3 (20210118)
suggested.
Any version of R would likely do, though the
current release of PowerTOST
was only tested with version
4.0.5 (20210331) and later. All scripts were run on a Xeon E31245v3 @
3.40GHz (1/4 cores) 16GB RAM with 64 bit R
4.1.3 on Windows 7.
Note that in all functions of PowerTOST
the arguments
(say, the assumed T/Rratio theta0
, the assumed coefficient
of variation CV
, etc.) have to be given as ratios and not
in percent.
sampleN.RSABE()
gives balanced sequences (i.e.,
an equal number of subjects is allocated to all sequences). Furthermore,
the estimated sample size is the total number of subjects.
All examples deal with studies where the response variables likely follow a lognormal distribution, i.e., we assume a multiplicative model (ratios instead of differences). We work with \(\small{\log_{e}}\)transformed data in order to allow analysis by the ttest (requiring differences).
It may sound picky but ‘sample size calculation’ (as used in most guidelines and alas, in some publications and textbooks) is sloppy terminology. In order to get prospective power (and hence, a sample size), we need five values:
where
In other words, obtaining a sample size is not an exact calculation like \(\small{2\times2=4}\) but always just an estimation.
“Power Calculation – A guess masquerading as mathematics.
Of note, it is extremely unlikely that all assumptions will be exactly realized in a particular study. Hence, calculating retrospective (a.k.a. post hoc, a posteriori) power is not only futile but plain nonsense.^{14}
Since generally the withinsubject variability \(\small{CV_\textrm{w}}\) is smaller than the betweensubject variability \(\small{CV_\textrm{b}}\), crossover studies are so popular. Of note, there is no relationship between \(\small{CV_\textrm{w}}\) and \(\small{CV_\textrm{b}}\). An example are drugs which are subjected to polymorphic metabolism. For these drugs \(\small{CV_\textrm{w}\ll CV_\textrm{b}}\).
Furthermore, the drugs’ withinsubject variability may be unequal (i.e., \(\small{s_{\textrm{wT}}^{2}\neq s_{\textrm{wR}}^{2}}\)). For details see the section Heteroscedasticity.
It is a prerequisite that no carryover from one period to the next exists. Only then the comparison of treatments will be unbiased.^{15} Carryover is elaborated in another article.
For neRs only!The sample size cannot be
directly estimated,
in SABE only power
simulated for an already given sample size.
“Power. That which statisticians are always calculating but never have.
Let’s start with PowerTOST
.
library(PowerTOST) # attach it to run the examples
The sample size functions’ defaults are:
Argument  Default  Meaning 

alpha

0.05

Nominal level of the test. 
targetpower

0.80

Target (desired) power. 
theta0

0.90

Assumed T/Rratio. 
theta1

0.80

Lower BElimit in ABE and lower PEconstraint in RSABE. 
theta2

1.25

Upper BElimit in ABE and upper PEconstraint in RSABE. 
design

"2x3x3"

Treatments × Sequences × Periods. 
regulator

"FDA"

Guess… 
nsims

1e05

Number of simulations. 
print

TRUE

Output to the console. 
details

TRUE

Show regulatory settings and sample size search. 
setseed

TRUE

Set a fixed seed (recommended for reproducibility). 
For a quick overview of the ‘implied limits’ use the function
scABEL()
– for once in percent according to the
guidance.
< data.frame(CV = 100*c(0.25, se2CV(0.25), 0.3, se2CV(0.294),
df 0.5, 0.65),
swR = NA,
method = c("ABE", rep("SABE", 5)),
applicable = c(rep("ABE", 3), rep("RSABE", 3)),
cap = c(rep("lower", 3), rep("  ", 3)),
L = NA, U = NA)
for (i in 1:6) {
$swR[i] < signif(CV2se(df$CV[i]/100), 5)
df6:7] < sprintf("%.2f%%", 100*scABEL(df$CV[i]/100,
df[i, regulator = "FDA"))
}$CV < sprintf("%.3f%%", df$CV)
dfprint(df, row.names = FALSE)
# CV swR method applicable cap L U
# 25.000% 0.24622 ABE ABE lower 80.00% 125.00%
# 25.396% 0.25000 SABE ABE lower 80.00% 125.00%
# 30.000% 0.29356 SABE ABE lower 80.00% 125.00%
# 30.047% 0.29400 SABE RSABE  76.92% 130.01%
# 50.000% 0.47238 SABE RSABE  65.60% 152.45%
# 65.000% 0.59365 SABE RSABE  58.87% 169.87%
The sample size functions of PowerTOST
use a
modification of Zhang’s method^{18} based on the large sample approximation
as the starting value of the iterations.
# Note that theta0 = 0.90 is the default
sampleN.RSABE(CV = 0.45, targetpower = 0.80,
design = "2x2x4", details = TRUE)
#
# ++++++++ Reference scaled ABE crit. +++++++++
# Sample size estimation
# 
# Study design: 2x2x4 (4 period full replicate)
# logtransformed data (multiplicative model)
# 1e+05 studies for each step simulated.
#
# alpha = 0.05, target power = 0.8
# CVw(T) = 0.45; CVw(R) = 0.45
# True ratio = 0.9
# ABE limits / PE constraints = 0.8 ... 1.25
# FDA regulatory settings
#  CVswitch = 0.3
#  regulatory constant = 0.8925742
#  pe constraint applied
#
#
# Sample size search
# n power
# 18 0.71411
# 20 0.75794
# 22 0.79514
# 24 0.82450
Throughout the examples I’m referring to studies in a single center – not multiple groups within them or multicenter studies. That’s another pot of tea.
We assume a CV of 0.45, a T/Rratio of 0.90, a target a power of 0.80, and want to perform the study in a twosequence fourperiod full replicate study (TRTRRTRT or TRRTRTTR or TTRRRRTT).
Since theta0 = 0.90
,^{19} targetpower = 0.80
, and
regulator = "FDA"
are defaults of the function, we don’t
have to give them explicitely. As usual in bioequivalence,
alpha = 0.05
is employed (we will assess the study by a
\(\small{100\,(12\,\alpha)=90\%}\)
confidence interval). Hence, you need to specify only the
CV
(assuming \(\small{CV_\textrm{wT}=CV_\textrm{wR}}\))
and design = "2x2x4"
.
To shorten the output, use the argument
details = FALSE
.
sampleN.RSABE(CV = 0.45, design = "2x2x4", details = FALSE)
#
# ++++++++ Reference scaled ABE crit. +++++++++
# Sample size estimation
# 
# Study design: 2x2x4 (4 period full replicate)
# logtransformed data (multiplicative model)
# 1e+05 studies for each step simulated.
#
# alpha = 0.05, target power = 0.8
# CVw(T) = 0.45; CVw(R) = 0.45
# True ratio = 0.9
# ABE limits / PE constraints = 0.8 ... 1.25
# Regulatory settings: FDA
#
# Sample size
# n power
# 24 0.82450
Sometimes we are not interested in the entire output and want to use
only a part of the results in subsequent calculations. We can suppress
the output by stating the additional argument print = FALSE
and assign the result to a data frame (here x
).
< sampleN.RSABE(CV = 0.45, design = "2x2x4",
x details = FALSE, print = FALSE)
Let’s retrieve the column names of x
:
names(x)
# [1] "Design" "alpha" "CVwT"
# [4] "CVwR" "theta0" "theta1"
# [7] "theta2" "Sample size" "Achieved power"
# [10] "Target power" "nlast"
Now we can access the elements of x
by their names. Note
that double square brackets [[…]]
have to be used.
"Sample size"]]
x[[# [1] 24
"Achieved power"]]
x[[# [1] 0.8245
Although you could access the elements by the number of the
column(s), I don’t recommend that, since in various other functions of
PowerTOST
these numbers are different and hence, difficult
to remember. If you insist in accessing elements by columnnumbers, use
single square brackets […]
.
8:9]
x[# Sample size Achieved power
# 1 24 0.8245
With 24 subjects (twelve per sequence) we achieve the power we desire.
What happens if we have one dropout?
power.RSABE(CV = 0.45, design = "2x2x4",
n = x[["Sample size"]]  1)
# Unbalanced design. n(i)=12/11 assumed.
# [1] 0.8105
Still above 0.80 we desire. However, with two dropouts (22 eligible subjects) we would slightly fall short (0.7951).
Since dropouts are common, it makes sense to include / dose more subjects in order to end up with a number of eligible subjects which is not lower than our initial estimate.
Let us explore that in the next section.
We define two supportive functions:
n
will be rounded up to achieve balance.< function(n, n.seq) {
balance return(as.integer(n.seq * (n %/% n.seq + as.logical(n %% n.seq))))
}
n
and the anticipated droputrate
do.rate
.< function(n, do.rate, n.seq) {
nadj return(as.integer(balance(n / (1  do.rate), n.seq)))
}
In order to come up with a suggestion we have to anticipate a (realistic!) dropout rate. Note that this not the job of the statistician; ask the Principal Investigator.
“It is a capital mistake to theorise before one has data.
The dropoutrate is calculated from the eligible and
dosed subjects
or simply \[\begin{equation}\tag{3}
do.rate=1n_\textrm{eligible}/n_\textrm{dosed}
\end{equation}\] Of course, we know it only after the
study was performed.
By substituting \(n_\textrm{eligible}\) with the estimated sample size \(n\), providing an anticipated dropoutrate and rearrangement to find the adjusted number of dosed subjects \(n_\textrm{adj}\) we should use \[\begin{equation}\tag{4} n_\textrm{adj}=\;\upharpoonleft n\,/\,(1do.rate) \end{equation}\] where \(\upharpoonleft\) denotes rounding up to the complete sequence as implemented in the functions above.
An all too common mistake is to increase the estimated sample size \(n\) by the dropoutrate according to \[\begin{equation}\tag{5} n_\textrm{adj}=\;\upharpoonleft n\times(1+do.rate) \end{equation}\] If you used \(\small{(5)}\) in the past – you are not alone. In a small survey a whopping 29% of respondents reported to use it.^{21} Consider changing your routine.
“There are no routine statistical questions, only questionable statistical routines.
In the following I specified more arguments to make the function more
flexible.
Note that I wrapped the function power.RSABE()
in
suppressMessages()
. Otherwise, the function will throw for
any odd sample size a message telling us that the design is
unbalanced. Well, we know that.
< 0.45 # withinsubject CV
CV < 0.80 # target (desired) power
target < 0.90 # assumed T/Rratio
theta0 < "2x2x4"
design < 0.15 # anticipated dropoutrate 15%
do.rate # might be relatively high
# due to the 4 periods
< as.integer(substr(design, 3, 3))
n.seq < scABEL(CV) # expanded limits
lims < sampleN.RSABE(CV = CV, theta0 = theta0,
df targetpower = target,
design = design,
details = FALSE,
print = FALSE)
# calculate the adjusted sample size
< nadj(df[["Sample size"]], do.rate, n.seq)
n.adj # (decreasing) vector of eligible subjects
< n.adj:df[["Sample size"]]
n.elig < paste0("Assumed CV : ",
info
CV,"\nAssumed T/R ratio : ",
theta0,"\nExpanded limits : ",
sprintf("%.4f\u2026%.4f",
1], lims[2]),
lims["\nPE constraints : ",
sprintf("%.4f\u2026%.4f",
0.80, 1.25), # fixed in ABEL
"\nTarget (desired) power : ",
target,"\nAnticipated dropoutrate: ",
do.rate,"\nEstimated sample size : ",
"Sample size"]], " (",
df[["Sample size"]]/n.seq, "/sequence)",
df[["\nAchieved power : ",
signif(df[["Achieved power"]], 4),
"\nAdjusted sample size : ",
" (", n.adj/n.seq, "/sequence)",
n.adj, "\n\n")
# explore the potential outcome for
# an increasing number of dropouts
< signif((n.adj  n.elig) / n.adj, 4)
do.act < data.frame(dosed = n.adj,
df eligible = n.elig,
dropouts = n.adj  n.elig,
do.act = do.act,
power = NA)
for (i in 1:nrow(df)) {
$power[i] < suppressMessages(
dfpower.RSABE(CV = CV,
theta0 = theta0,
design = design,
n = df$eligible[i]))
}cat(info); print(round(df, 4), row.names = FALSE)
# Assumed CV : 0.45
# Assumed T/R ratio : 0.9
# Expanded limits : 0.7215…1.3859
# PE constraints : 0.8000…1.2500
# Target (desired) power : 0.8
# Anticipated dropoutrate: 0.15
# Estimated sample size : 24 (12/sequence)
# Achieved power : 0.8245
# Adjusted sample size : 30 (15/sequence)
# dosed eligible dropouts do.act power
# 30 30 0 0.0000 0.8899
# 30 29 1 0.0333 0.8810
# 30 28 2 0.0667 0.8729
# 30 27 3 0.1000 0.8612
# 30 26 4 0.1333 0.8508
# 30 25 5 0.1667 0.8377
# 30 24 6 0.2000 0.8245
In the worst case (six dropouts) we end up with the originally estimated sample size of 24. Power preserved, mission accomplished. If we have less dropouts, splendid – we gain power.
If we would have adjusted the sample size according to \(\small{(5)}\) we
would have dosed 28 subjects.
If the anticipated dropout rate of 15% is realized in the study, we
would have 23 eligible subjects (power 0.8105). In this example we
achieve still more than our target power but the loss might be relevant
in other cases.
As said in the preliminaries, calculating post hoc power is futile.
“There is simple intuition behind results like these: If my car made it to the top of the hill, then it is powerful enough to climb that hill; if it didn’t, then it obviously isn’t powerful enough. Retrospective power is an obvious answer to a rather uninteresting question. A more meaningful question is to ask whether the car is powerful enough to climb a particular hill never climbed before; or whether a different car can climb that new hill. Such questions are prospective, not retrospective.
However, sometimes we are interested in it for planning the next study.
If you give and odd total sample size n
,
power.RSABE()
will try to keep sequences as balanced as
possible and show in a message how that was done.
< 25
n.act signif(power.RSABE(CV = 0.45, n = n.act,
design = "2x2x4"), 6)
# Unbalanced design. n(i)=13/12 assumed.
# [1] 0.83767
Say, our study was more unbalanced. Let us assume that we dosed 30
subjects, the total number of subjects was also 25 but all dropouts
occured in one sequence (unlikely but possible).
Instead of the total sample size n
we can give the number
of subjects of each sequence as a vector (the order is generally^{23} not
relevant, i.e., it does not matter which element refers to
which sequence).
By setting details = TRUE
we can retrieve the components
of the simulations (probability to pass each test).
< "2x2x4"
design < 0.45
CV < 30
n.adj < 25
n.act < n.adj / 2
n.s1 < n.act  n.s1
n.s2 < 0.90
theta0 < suppressMessages(
post.hoc power.RSABE(CV = CV,
n = c(n.s1, n.s2),
theta0 = theta0,
design = design,
details = TRUE))
< power.TOST(CV = CV,
ABE.xact n = c(n.s1, n.s2),
theta0 = theta0,
design = design)
< nchar(as.character(n.adj))
sig.dig < paste0("%", sig.dig, ".0f (%",
fmt ".0f dropouts)")
sig.dig, cat(paste0("Dosed subjects: ", sprintf("%2.0f", n.adj),
"\nEligible : ",
sprintf(fmt, n.act, n.adj  n.act),
"\n Sequence 1 : ",
sprintf(fmt, n.s1, n.adj / 2  n.s1),
"\n Sequence 1 : ",
sprintf(fmt, n.s2, n.adj / 2  n.s2),
"\nPower overall : ",
sprintf("%.5f", post.hoc[1]),
"\n p(SABE) : ",
sprintf("%.5f", post.hoc[2]),
"\n p(PE) : ",
sprintf("%.5f", post.hoc[3]),
"\n p(ABE) : ",
sprintf("%.5f", post.hoc[4]),
"\n p(ABE) exact: ",
sprintf("%.5f", ABE.xact), "\n"))
# Dosed subjects: 30
# Eligible : 25 ( 5 dropouts)
# Sequence 1 : 15 ( 0 dropouts)
# Sequence 1 : 10 ( 5 dropouts)
# Power overall : 0.82821
# p(SABE) : 0.84762
# p(PE) : 0.91031
# p(ABE) : 0.34574
# p(ABE) exact: 0.35740
The components of overall power are:
p(SABE)
is the probability that the study passed the
scaled criterion.p(PE)
is the probability that the point estimate is
within 80.00 – 125.00%.p(ABE)
is the probability of passing conventional
Average Bioequivalence.power.TOST()
– confirming the simulation’s result.Of course, in a particular study you will provide the numbers in the
n
vector directly.
The CV and the T/Rratio are only assumptions.
Whatever their origin might be (literature, previous studies) they carry
some degree of uncertainty. Hence, believing^{24} that they are the
true ones may be risky.
Some statisticians call that the ‘CarvedinStone’ approach.
Say, we performed a pilot study in 16 subjects and estimated the CV as 0.45.
The \(\small{\alpha}\) confidence interval of the CV is obtained via the \(\small{\chi^2}\)distribution of its error variance \(\small{\sigma^2}\) with \(\small{n2}\) degrees of freedom. \[\begin{matrix}\tag{6} s^2=\log_{e}(CV^2+1)\\ L=\frac{(n1)\,s^2}{\chi_{\alpha/2,\,n2}^{2}}\leq\sigma^2\leq\frac{(n1)\,s^2}{\chi_{1\alpha/2,\,n2}^{2}}=U\\ \left\{lower\;CL,\;upper\;CL\right\}=\left\{\sqrt{\exp(L)1},\sqrt{\exp(U)1}\right\} \end{matrix}\]
Let’s calculate the 95% confidence interval of the CV to get an idea.
< 16 # pilot study
m < CVCL(CV = 0.45, df = m  2,
ci side = "2sided", alpha = 0.05)
signif(ci, 4)
# lower CL upper CL
# 0.3223 0.7629
Surprised? Although 0.45 is the best estimate for planning the next study, there is no guarantee that we will get exactly the same outcome. Since the \(\small{\chi^2}\)distribution is skewed to the right, it is more likely that we will face a higher CV than a lower one in the planned study.
If we plan the study based on 0.45, we would opt for 24 subjects like
in the examples before (not adjusted for the dropoutrate).
If the CV will be lower, we loose power (less expansion). But
what if it will be higher? Depends. Since we may scale more, we gain
power. However, at large values the point estimate constraint cuts in
and we will loose power. But how much?
Let’s explore what might happen at the confidence limits of the CV.
< 16
m < CVCL(CV = 0.45, df = m  2,
ci side = "2sided", alpha = 0.05)
< 28
n < data.frame(CV = c(ci[["lower CL"]], 0.45,
comp "upper CL"]]),
ci[[power = NA)
for (i in 1:nrow(comp)) {
$power[i] < power.RSABE(CV = comp$CV[i],
compdesign = "2x2x4",
n = n)
}1] < signif(comp[, 1], 4)
comp[, 2] < signif(comp[, 2], 6)
comp[, print(comp, row.names = FALSE)
# CV power
# 0.3223 0.79246
# 0.4500 0.87287
# 0.7629 0.81122
Might hurt.
What can we do? The larger the previous study was, the larger the degrees of freedom and hence, the narrower the confidence interval of the CV. In simple terms: The estimate is more certain. On the other hand, it also means that very small pilot studies are practically useless. What happens when we plan the study based on the confidence interval of the CV?
< seq(12, 30, 6)
m < data.frame(n.pilot = m, CV = 0.45,
df l = NA, u = NA,
n.low = NA, n.CV = NA, n.hi = NA)
for (i in 1:nrow(df)) {
3:4] < CVCL(CV = 0.45, df = m[i]  2,
df[i, side = "2sided",
alpha = 0.05)
5] < sampleN.RSABE(CV = df$l[i], design = "2x2x4",
df[i, details = FALSE,
print = FALSE)[["Sample size"]]
6] < sampleN.RSABE(CV = 0.45, design = "2x2x4",
df[i, details = FALSE,
print = FALSE)[["Sample size"]]
7] < sampleN.RSABE(CV = df$u[i], design = "2x2x4",
df[i, details = FALSE,
print = FALSE)[["Sample size"]]
}
3:4] < signif(df[, 3:4], 4)
df[, names(df)[3:4] < c("lower CL", "upper CL")
print(df, row.names = FALSE)
# n.pilot CV lower CL upper CL n.low n.CV n.hi
# 12 0.45 0.3069 0.8744 30 24 32
# 18 0.45 0.3282 0.7300 30 24 26
# 24 0.45 0.3415 0.6685 28 24 24
# 30 0.45 0.3509 0.6334 28 24 24
Small pilot studies are practically useless. One leading generic company has an internal rule to perform pilot studies of HVD(P)s in a fourperiod full replicate design and at least 24 subjects. Makes sense.
Furthermore, we don’t know where the true T/Rratio lies but we can calculate the lower 95% confidence limit of the pilot study’s point estimate to get an idea about a worst case. Say, it was 0.90.
< 16
m < 0.45
CV < 0.90
pe < round(CI.BE(CV = CV, pe = 0.90, n = m,
ci design = "2x2x4"), 4)
if (pe <= 1) {
< ci[["lower"]]
cl else {
} < ci[["upper"]]
cl
}print(cl)
# [1] 0.7515
Exlore the impact of a relatively 5% lower CV (less expansion) and a relatively 5% lower T/Rratio on power for the given sample size.
< 28
n < 0.45
CV < 0.90
theta0 < data.frame(CV = c(CV, CV*0.95),
comp1 power = NA)
< data.frame(theta0 = c(theta0, theta0*0.95),
comp2 power = NA)
for (i in 1:2) {
$power[i] < power.RSABE(CV = comp1$CV[i],
comp1theta0 = theta0,
design = "2x2x4",
n = n)
}$power < signif(comp1$power, 5)
comp1for (i in 1:2) {
$power[i] < power.RSABE(CV = CV,
comp2theta0 = comp2$theta0[i],
design = "2x2x4",
n = n)
}$power < signif(comp2$power, 5)
comp2print(comp1, row.names = FALSE)
print(comp2, row.names = FALSE)
# CV power
# 0.4500 0.87287
# 0.4275 0.86914
# theta0 power
# 0.900 0.87287
# 0.855 0.71029
Interlude
Note the logscale of the xaxis. It demonstrates that power curves are symmetrical around 1 (\(\small{\log_{e}(1)=0}\), where \(\small{\log_{e}(\theta_2)=\left\log_{e}(\theta_1)\right}\)) and we will achieve the same power for \(\small{\theta_0}\) and \(\small{1/\theta_0}\) (e.g., for 0.90 and 1.1111). Contrary to ABE, power is maintained.
<nitpick>< 0.45
CV < 0.10 # direction unknown
delta < "2x2x4"
design < c(1  delta, 1 / (1 + delta),
theta0s 1 + delta, 1 / (1  delta))
< sampleN.RSABE(CV = CV, theta0 = 1  delta,
n design = design,
details = FALSE,
print = FALSE)[["Sample size"]]
< data.frame(CV = CV, theta0 = theta0s,
comp1 base = c(TRUE, rep(FALSE, 3)),
n = n, power = NA)
for (i in 1:nrow(comp1)) {
$power[i] < power.RSABE(CV = CV,
comp1theta0 = comp1$theta0[i],
design = design, n = n)
}< sampleN.RSABE(CV = CV, theta0 = 1 + delta,
n design = design,
details = FALSE,
print = FALSE)[["Sample size"]]
< data.frame(CV = CV, theta0 = theta0s,
comp2 base = c(FALSE, FALSE, TRUE, FALSE),
n = n, power = NA)
for (i in 1:nrow(comp2)) {
$power[i] < power.RSABE(CV = CV,
comp2theta0 = comp2$theta0[i],
design = design, n = n)
}c(2, 5)] < signif(comp1[, c(2, 5)] , 4)
comp1[, c(2, 5)] < signif(comp2[, c(2, 5)] , 4)
comp2[, print(comp1, row.names = FALSE)
print(comp2, row.names = FALSE)
# CV theta0 base n power
# 0.45 0.9000 TRUE 24 0.8245
# 0.45 0.9091 FALSE 24 0.8492
# 0.45 1.1000 FALSE 24 0.8499
# 0.45 1.1110 FALSE 24 0.8242
# CV theta0 base n power
# 0.45 0.9000 FALSE 22 0.7951
# 0.45 0.9091 FALSE 22 0.8199
# 0.45 1.1000 TRUE 22 0.8212
# 0.45 1.1110 FALSE 22 0.7954
</nitpick>
Essentially this leads to the murky waters of prospective sensitivity
analyses, which is covered in another article.
An appetizer to show the maximum deviations (CV, T/Rratio and
decreased sample size due to dropouts) which give still a minimum
acceptable power of ≥ 0.70:
< 0.45
CV < 0.90
theta0 < 0.80
target < 0.70
minpower < pa.scABE(CV = CV, theta0 = theta0,
pa targetpower = target,
minpower = minpower,
regulator = "FDA",
design = "2x2x4")
< 100*(tail(pa$paCV[["CV"]], 1) 
change.CV $plan[["CVwR"]]) /
pa$plan[["CVwR"]]
pa< 100*(head(pa$paGMR$theta0, 1) 
change.theta0 $plan$theta0) /
pa$plan[["theta0"]]
pa< 100*(tail(pa$paN[["N"]], 1) 
change.n $plan[["Sample size"]]) /
pa$plan[["Sample size"]]
pa< data.frame(parameter = c("CV", "theta0", "n"),
comp change = c(change.CV,
change.theta0,
change.n))$change < sprintf("%+.2f%%", comp$change)
compnames(comp)[2] < "relative change"
print(pa, plotit = FALSE)
print(comp, row.names = FALSE)
# Sample size plan scABE (FDA/RSABE)
# Design alpha CVwT CVwR theta0 theta1 theta2 Sample size
# 2x2x4 0.05 0.45 0.45 0.9 0.8 1.25 24
# Achieved power Target power
# 0.8245 0.8
#
# Power analysis
# CV, theta0 and number of subjects leading to min. acceptable power of ~0.7:
# CV= 1.1132, theta0= 0.8647
# n = 18 (power= 0.7141)
#
# parameter relative change
# CV +147.37%
# theta0 3.92%
# n 25.00%
Confirms what we have seen above. As expected the method is extremely robust to changes of the CV. The sample size is also not very sensitive; many overrate the impact of dropouts on power.
I recommend to perform pilot studies in one of the fully replicated
designs. If you are concerned about dropouts or the bioanalytical method
requires large sample volumes, opt for one the 2sequence 3period
designs (TRTRTR or TRRRTT).
Contrary to the partial replicate design (TRRRTRRRT) we get estimates
of both \(\small{CV_\textrm{wT}}\) and
\(\small{CV_\textrm{wR}}\). Since
pharmaceutical technology – hopefully – improves, it is not uncommon
that \(\small{CV_\textrm{wT}<CV_\textrm{wR}}\).
If this is the case, we get an incentive in the sample size of the
pivotal study (scaling is based on \(\small{CV_\textrm{wR}}\) but the 90%
CI on the – pooled – \(\small{s_\textrm{w}^{2}}\)).
\[\eqalign{\tag{7} s_\textrm{wT}^{2}&=\log_{e}(CV_\textrm{wT}^{2}+1)\\ s_\textrm{wR}^{2}&=\log_{e}(CV_\textrm{wR}^{2}+1)\\ s_\textrm{w}^{2}&=\left(s_\textrm{wT}^{2}+s_\textrm{wR}^{2}\right)/2\\ CV_\textrm{w}&=\sqrt{\exp(s_\textrm{w}^{2})1}}\]
Say, we performed two pilot studies.
In the partial replicate we estimated the \(\small{CV_\textrm{w}}\) with 0.45 (and have
to assume homoscedasticity, i.e., \(\small{CV_\textrm{wT}=CV_\textrm{wR}}\),
which might be wrong).
In the full replicate we estimated \(\small{CV_\textrm{wT}}\) with 0.414 and
\(\small{CV_\textrm{wR}}\) with 0.484.
Note that the \(\small{CV_\textrm{w}}\)
is 0.45 as well. How will that impact the sample size of the pivotal
4period full replicate design?
< data.frame(pilot = c("TRRRTRRRT", "TRTRTR"),
comp CVwT = c(0.45, 0.414),
CVwR = c(0.45, 0.484),
CVw = NA,
n = NA, power = NA)
for (i in 1:nrow(comp)) {
4] < signif(
comp[i, mse2CV((CV2mse(comp$CVwT[i]) +
CV2mse(comp$CVwR[i])) / 2), 3)
5:6] < sampleN.RSABE(CV = c(comp$CVwT[i], comp$CVwR[i]),
comp[i, design = "2x2x4", details = FALSE,
print = FALSE)[8:9]
}print(comp, row.names = FALSE)
# pilot CVwT CVwR CVw n power
# TRRRTRRRT 0.450 0.450 0.45 24 0.82450
# TRTRTR 0.414 0.484 0.45 20 0.80146
Since bioanalytics drives study costs to a great extent, we may safe ~17%.
Note that when you give CV
as twoelement vector, the
first element has to be \(\small{CV_\textrm{wT}}\) and the second
\(\small{CV_\textrm{wR}}\).
Although according to the guidances it is not required to estimate \(\small{CV_\textrm{wT}}\), its value is ‘nice to know’. Sometimes studies fail only due to the large \(\small{CV_\textrm{wR}}\) thus inflating the confidence interval. In such a case you have at least ammunation to start an argument.
Even if you plan the pivotal study in a partial replicate design (why on earth?)^{25} knowing both \(\small{CV_\textrm{wT}}\) and \(\small{CV_\textrm{wR}}\) is useful.
< data.frame(pilot = c("TRRRTRRRT", "TRTRTR"),
comp CVwT = c(0.45, 0.414),
CVwR = c(0.45, 0.484),
CVw = NA,
n = NA, power = NA)
for (i in 1:nrow(comp)) {
4] < signif(
comp[i, mse2CV((CV2mse(comp$CVwT[i]) +
CV2mse(comp$CVwR[i])) / 2), 3)
5:6] < sampleN.RSABE(CV = c(comp$CVwT[i], comp$CVwR[i]),
comp[i, design = "2x3x3", details = FALSE,
print = FALSE)[8:9]
}print(comp, row.names = FALSE)
# pilot CVwT CVwR CVw n power
# TRRRTRRRT 0.450 0.450 0.45 33 0.82802
# TRTRTR 0.414 0.484 0.45 27 0.81239
Again, a smaller sample size is possible.
For demonstrating bioequivalence for the FDA and China’s CDE the pharmacokinetic metrics C_{max}, AUC_{0–t}, and AUC_{0–∞} are mandatory.
We don’t have to worry about multiplicity issues (inflated Type I Error) since if all tests must pass at level \(\alpha\), we are protected by the intersectionunion principle.^{26} ^{27} We design the study always for the worst case combination, i.e., based on the PK metric requiring the largest sample size. Let’s explore that with different CVs and T/Rratios.
< c("Cmax", "AUC")
metrics < c(0.45, 0.30)
CV < c(0.90, 0.925)
theta0 < "2x2x4"
design < data.frame(metric = metrics, CV = CV, swR = CV2se(CV),
df theta0 = theta0, n = NA, power = NA)
1, 5:6] < sampleN.RSABE(CV = CV[1], theta0 = theta0[1],
df[design = design, details = FALSE,
print = FALSE)[8:9]
2, 5:6] < sampleN.RSABE(CV = CV[2], theta0 = theta0[2],
df[design = design, details = FALSE,
print = FALSE)[8:9]
$swR < signif(df$swR, 5)
df< paste0("Sample size based on ",
txt $metric[df$n == max(df$n)], ".\n")
dfprint(df, row.names = FALSE)
cat(txt)
# metric CV swR theta0 n power
# Cmax 0.45 0.42942 0.900 24 0.82450
# AUC 0.30 0.29356 0.925 22 0.81108
# Sample size based on Cmax.
As usual the ‘worse’ PK metric
drives the sample size. Although we could scale for
C_{max} (s_{wR}
0.42942 ≥ 0.294) but not for AUC s_{wR} 0.29356 < 0.294), this
advantage is counteracted by its ‘worse’ T/Rratio (0.900 instead of
0.925). Power is extremely sensitive to the T/Rratio (see Fig. 4).
Consequently, with 24 subjects the study will be slightly
‘overpowered’ (~0.839) for AUC.
Let us assume the same T/Rratios for both metrics. Which are the extreme T/Rratios (largest deviations of T from R) for C_{max} giving still the target power?
< function(x) {
opt power.RSABE(theta0 = x, CV = df$CV[1],
design = design,
n = df$n[2])  target
}< c("Cmax", "AUC")
metrics < c(0.45, 0.30)
CV < 0.90
theta0 < 0.80
target < "2x2x4"
design < data.frame(metric = metrics, CV = CV, swR = CV2se(CV),
df theta0 = theta0, n = NA, power = NA)
1, 5:6] < sampleN.RSABE(CV = CV[1], theta0 = theta0,
df[design = design, details = FALSE,
print = FALSE)[8:9]
2, 5:6] < sampleN.RSABE(CV = CV[2], theta0 = theta0,
df[design = design, details = FALSE,
print = FALSE)[8:9]
$swR < signif(df$swR, 5)
df$power < signif(df$power, 5)
dfif (theta0[1] < 1) {
< uniroot(opt, tol = 1e8,
res interval = c(0.80 + 1e4, theta0))
else {
} < uniroot(opt, tol = 1e8,
res interval = c(theta0, 1.25  1e4))
}< unlist(res)
res < c(res[["root"]], 1/res[["root"]])
theta0s < paste0("Target power for ", metrics[1],
txt " and sample size ",
$n[2], "\nachieved for theta0 ",
dfsprintf("%.4f", theta0s[1]), " or ",
sprintf("%.4f", theta0s[2]), ".\n")
print(df, row.names = FALSE)
cat(txt)
# metric CV swR theta0 n power
# Cmax 0.45 0.42942 0.9 24 0.82450
# AUC 0.30 0.29356 0.9 32 0.81666
# Target power for Cmax and sample size 32
# achieved for theta0 0.8661 or 1.1546.
That means, although we assumed for C_{max} the same T/Rratio as for AUC, with the sample size of 32 required AUC, for C_{max} it can be as low as ~0.866 or as high as ~1.155, which is an interesting sideeffect.
Q: Can we use R in a
regulated environment and is PowerTOST
validated?
A: See this document^{28} about the
acceptability of Base R.
R
is updated every couple of
months with documented changes^{29} and maintaining a bugtracking system.^{30} I
recommed to use always the latest release.
The authors of PowerTOST
tried to do their best to provide
reliable and valid results. The ‘NEWS’file
on CRAN documents the development of the package, bugfixes, and
introduction of new methods.
Validation of any software (yes, of SAS as well…) lies in the
hands of the user.^{31} ^{32}
Execute the script test_RSABE.R
which can be found in the
/tests
subdirectory of the package to reproduce tables
given in the literature.^{33} You will notice some discrepancies: The
authors employed only 10,000 simulations – which is not sufficient for a
stable result (see below). Furthermore, they
reported the minimum sample size which gives at least the target power,
wheras sampleN.RSABE()
always rounds up to obtain balanced
sequences.
Q: Shall we throw away our sample size
tables?
A: Not at all. File them in your archives to collect
dust. Maybe in the future you will be asked by an agency how you arrived
at a sample size. But: Don’t use them any more. What you should not do
(and hopefully haven’t done before): Interpolate. Power and therefore,
the sample size depends in a highly nonlinear fashion on the five
conditions listed above, which makes
interpolation of values given in table a nontrivial job.
Q: I fail to understand your example about
dropouts. We finish the study with 24 eligible subjects as desired. Why
is the dropoutrate ~20% and not the anticipated 15%?
A: That’s due to rounding up the calculated adjusted
sample size (28.24…) to the next even number (30). If you
manage it to dose fractional subjects (I can’t), your dropout rate would
indeed equal the anticipated one: 100 × (1 – 24/28.24…) = 15%. ⬜
Q: Do we have to worry about unbalanced
sequences?
A: sampleN.RSABE()
will always give the
total number of subjects for balanced sequences.
If you are interested in post hoc power, give the sample size
as a vector, i.e.,
power.RSABE(..., n = c(foo, bar, baz)
, where
foo
, bar
, and baz
are the number
of subjects per sequence.
Q: The
default number of simulations in the sample size estimation is
100,000. Why?
A: We found that with this number the simulations are
stable. For the background see another article. Of
course, you can give a larger number in the argument nsims
.
However, you shouldn’t decrease the number.
Q: How reliable are the results?
A: As stated above an exact
method doesn’t exist. We can only compare the empiric power of the
ABEcomponent to
the exact one obtained by power.TOST()
. For an example see
‘Post hoc Power’ in the section about Dropouts.
Q: I still have questions. How to proceed?
A: The preferred method is to register at the BEBA Forum and post your question in
the category or (please read the Forum’s Policy
first).
You can contact me at [email protected]. Be
warned – I will charge you for anything beyond most basic
questions.
Licenses
Helmut Schütz 2022
R
and PowerTOST
GPL 3.0.
1^{st} version March 27, 2021. Rendered April 04, 2022 13:40
CEST by rmarkdown via pandoc in 1.45 seconds.
Footnotes and References
Labes D, Schütz H, Lang B. PowerTOST: Power and Sample Size for (Bio)Equivalence Studies. Package version 1.5.4.9000. 20220317. CRAN.↩︎
Schütz H. ReferenceScaled Average Bioequivalence. 20220219. CRAN.↩︎
Labes D, Schütz H, Lang B. Package ‘PowerTOST’. February 19, 2022. CRAN.↩︎
Some gastric resistant formulations of diclofenac are HVDPs, practically all topical formulations are HVDPs, whereas diclofenac itself is not a HVD (\(\small{CV_\textrm{w}}\) of a solution ~8%).↩︎
Tóthfalusi L, Endrényi L, GarcíaArieta A. Evaluation of bioequivalence for highly variable drugs with scaled average bioequivalence. Clin Pharmacokinet. 2009; 48(11): 725–43. doi:10.2165/1131804000000000000000.↩︎
FDA, CDER. Draft Guidance for Industry. Bioequivalence Studies with Pharmacokinetic Endpoints for Drugs Submitted Under an ANDA. Rockville. August 2021. Download.↩︎
CDE. Annex 2. Technical guidelines for research on bioequivalence of highly variable drugs. Download. [Chinese].↩︎
For unfathomable reasons the FDA recommends a mixedeffects model for fully replicated designs (SAS PROC MIXED) and a fixedeffects model (SAS PROC GLM) for the partial replicate design.↩︎
Howe WG. Approximate Confidence Limits on the Mean of X+Y Where X and Y are Two Tabled Independent Random Variables. J Am Stat Assoc. 1974; 69(347): 789–94. doi:10.2307/2286019.↩︎
Davit BM, Chen ML, Conner DP, Haidar SH, Kim S, Lee CH, Lionberger RA, Makhlouf FT, Nwakama PE, Patel DT, Schuirmann DJ, Yu LX. Implementation of a ReferenceScaled Average Bioequivalence Approach for Highly Variable Generic Drug Products by the US Food and Drug Administration. AAPS J. 2012; 14(4): 915–24. doi:10.1208/s122480129406x. Free Full Text.↩︎
At a \(\small{CV_{\textrm{wR}}}\) which is infinitesimally lower than \(\small{CV_{\textrm{wR}}\;0.30047}\) the ‘implied limits’ are still 80.00 – 125.00% but jump to 76.92 – 130.01% at \(\small{CV_{\textrm{wR}}\approx0.30047}\).↩︎
That’s contrary to ABE, where \(\small{CV_\textrm{w}}\) is an assumption as well.↩︎
Senn S. Guernsey McPearson’s Drug Development Dictionary. 21 April 2020. Online.↩︎
Hoenig JM, Heisey DM. The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. Am Stat. 2001; 55(1): 19–24. doi:10.1198/000313001300339897. Open Access.↩︎
There is no statistical method to ‘correct’ for unequal carryover. It can only be avoided by design, i.e., a sufficiently long washout between periods. According to the guidelines subjects with predose concentrations > 5% of their C_{max} can by excluded from the comparison if stated in the protocol.↩︎
This is not always the case in RSABE. This issue is elaborated in another article.↩︎
Senn S. Statistical Issues in Drug Development. Chichester: John Wiley; 2^{nd} ed 2007.↩︎
Zhang P. A Simple Formula for Sample Size Calculation in Equivalence Studies. J Biopharm Stat. 2003; 13(3): 529–38. doi:10.1081/BIP120022772.↩︎
Don’t be tempted to give a ‘better’ T/Rratio – even if based on a pilot or a previous study. It is a natural property of HVD(P)s that the T/Rratio varies between studies. Don’t be overly optimistic!↩︎
Doyle AC. The Adventures of Sherlock Holmes. A Scandal in Bohemia. 1892. p. 3.↩︎
Schütz H. Sample Size Estimation in Bioequivalence. Evaluation. 20201023. BEBA Forum.↩︎
Lenth RV. Two SampleSize Practices that I Don’t Recommend. October 24, 2000. Online.↩︎
The only exception is design = "2x2x3"
(the full replicate with sequences TRTRTR). Then the first element is
for sequence TRT and the second for RTR.↩︎
Quoting my late father: »If you believe, go to church.«↩︎
For obstacles see this article.↩︎
Berger RL, Hsu JC. Bioequivalence Trials, IntersectionUnion Tests and Equivalence Confidence Sets. Stat Sci. 1996; 11(4): 283–302. JSTOR:2246021.↩︎
Zeng A. The TOST confidence intervals and the coverage probabilities with R simulation. March 14, 2014. Online.↩︎
The R Foundation for Statistical Computing. A Guidance Document for the Use of R in Regulated Clinical Trial Environments. October 18, 2021. Online.↩︎
FDA. Statistical Software Clarifying Statement. May 6, 2015. Download.↩︎
WHO. Guidance for organizations performing in vivo bioequivalence studies. Geneva. May 2016. Technical Report Series No. 996, Annex 9. Section 4. Online.↩︎
Tóthfalusi L, Endrényi L. Sample Sizes for Designing Bioequivalence Studies for Highly Variable Drugs. J Pharm Pharmaceut Sci. 2012; 15(1): 73–84. doi:10.18433/J3Z88F. Open Access.↩︎