Consider allowing JavaScript. Otherwise, you have to be proficient in reading since formulas will not be rendered. Furthermore, the table of contents in the left column for navigation will not be available and code-folding not supported. Sorry for the inconvenience.

Examples in this article were generated with  4.2.1 by the packages PowerTOST1 and shape.2

• The right-hand badges give the ‘level’ of the respective section.

1. Basics about sample size methodology and study designs – requiring no or only limited statistical expertise.

1. These sections are the most important ones. They are – hopefully – easily comprehensible even for novices.

1. A somewhat higher knowledge of statistics and/or R is required. May be skipped or reserved for a later reading.

1. An advanced knowledge of statistics and/or R is required. Not recommended for beginners in particular.
• Click to show / hide R code.
Abbreviation Meaning
$$\small{\alpha}$$ Level of the test, probability of the Type I Error (consumer risk)
(A)BE (Average) Bioequivalence
CI Confidence Interval
$$\small{CV}$$ Within-subject Coefficient of Variation (paired, crossover, replicate designs),
total CV (parallel designs)
$$\small{\Delta}$$ Clinically relevant difference
$$\small{G\times T}$$ Group-by-Treatment interaction
$$\small{H_0}$$ Null hypothesis
$$\small{H_1}$$ Alternative hypothesis (also $$\small{H_\textrm{a}}$$)
$$\small{\theta_0=\mu_\textrm{T}/\mu_\textrm{R}}$$ True T/R-ratio
$$\small{\theta_1,\;\theta_2}$$ Lower and upper limits of the acceptance range (BE margins)
2×2×2 Two treatment, two sequence, two period crossover design

# Introduction

How to deal with multiple groups or clinical sites?

The most simple – and preferrable – approach is to find a clinical site which is able to accommodate all subjects at once. If this is not possible, subjects could be allocated to multiple groups or sites. Whether or not a group- (site-) term should by included in the statistical model is still the topic of heated discussions lively debates. In the case of multi-site studies regulators likely require a modification of the model.3 4 5

Since in replicate designs less subjects are required to achieve the same power than in a conventional 2×2×2 crossover design, sometimes multi-group studies can be avoided (see another article).

For the FDA, in some MENA-states (especially Saudi Arabia), and in the EEA the ‘group effect’ is an issue.
Hundreds (or thousands‽) of studies have been performed in multiple groups, evaluated by the common statistical model ($$\small{\text{III}}$$ below), and were accepted by European agencies in the twinkling of an eye. Surprisingly, Euro­pean assessors started recently to ask for assessment of the ‘Group effect’ as well.

## Preliminaries

A basic knowledge of R is required. To run the scripts at least version 1.4.8 (2019-08-29) of PowerTOST is suggested. Any version of R would likely do, though the current release of PowerTOST was only tested with version 4.1.3 (2022-03-10) and later.
All scripts were run on a Xeon E3-1245v3 @ 3.40GHz (1/4 cores) 16GB RAM with R 4.2.1 on Win­dows 7 build 7601, Service Pack 1, Universal C Runtime 10.0.10240.16390.

We consider mainly multiple groups, i.e., studies performed in a single site. However, the concept is applicable for studies performed in multiple sites as well. Studies in groups in multiple sites are out of scope of this article.
The examples deal primarily with the 2×2×2 crossover design ($$\small{\textrm{TR}|\textrm{RT}}$$)6 but the concept is applicable to any kind of cross­over (Higher-Order, replicate designs) or parallel designs assessed for equivalence.

## Background

Sometimes studies are split into two or more groups of subjects or are performed in multiple clinical sites.

• Reasons:
• For ‘logistic’ ones, i.e., a limited capacity of the clinical site: Some jurisdictions (the EEA, the United Kingdom, the EEU, ASEAN States, Australia, New Zealand, Brazil, Egypt, and members of the GCC) allow reference-scaling only for Cmax – leading to extreme sample sizes if products are highly variable in AUC as well.
• Some PIs (especially in university hospitals) don’t trust in the test product and prefer to start the study in a small group of subjects.
• In large studies in patients recruitment might be an issue. Hence, such studies are regularly performed in multiple sites.
• The common model for crossover studies might not be correct any more.
• Periods refer to different dates.
• Questions may arise whether groups can be naïvely pooled. Pooling of data is valid it the strict sense only if
• all groups have the same size and
• Point Estimates (PEs) of groups / sites would be identical.

If PEs are not identical, this would be indicative of a true ‘Group-by-Treatment interaction’ (aka ‘group effect’), i.e., the outcome is not independent from the group or site.

## Practicalities

There are two approaches, the ‘stacked’ and the ‘staggered’. Say, we have a drug with a moderate half life and a washout of six days is considered sufficient. In both approaches we keep one day between groups. Otherwise, the last sampling of one group would overlap with the pre-dose sampling of the next. A logistic nightmare, overruning the capacity of the clinical site even if the last sampling would be ambulatory.

In the ‘stacked’ approach one would complete the first group before the next starts. Sounds ‘natural’ although it’s a waste of time. Furthermore, if a confounding variable (say, the lunar phase, the weather, you name it) lures in the back, one may run into trouble with the Group-by-Treatment $$\small{(G\times T)}$$ test in model $$\small{(\text{I})}$$. It will falsely detect a difference between groups despite the fact that the true difference is caused by the confounder.
Regardless of this problem it is the method of choice in multiple dose studies, especially if subjects are hospitalized.

In the ‘staggered’ approach we squash the first period of the second group in the washout of the first. Not only faster (which is a nice side effect) but if we apply the $$\small{G\times T}$$  test in model $$\small{(\text{I})}$$, it’s less likely that something weird will happen than in the ‘stacked’ approach. Furthermore, one gets ammunition in an argument with regulators because the interval between groups is substantially smaller than in the ‘stacked’ approach.

# Statistics

## Hypotheses

The lower and upper limits of the bioequivalence (BE) range $$\small{\{\theta_1,\theta_2\}}$$ are defined based on the ‘clinically relevant difference’ $$\small{\Delta}$$ assuming log-normal distributed data $\left\{\theta_1=100\,(1-\Delta),\,\theta_2=100\,(1-\Delta)^{-1}\right\}\tag{1}$ Commonly $$\small{\Delta}$$ is set to 0.20. Hence, we obtain: $\left\{\theta_1=80.00\%,\,\theta_2=125.00\%\right\}\tag{2}$

Conventionally BE is assessed by the confidence interval (CI) inclusion approach: $H_0:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}\not\subset\left\{\theta_1,\theta_2\right\}\:vs\:H_1:\theta_1<\frac{\mu_\textrm{T}}{\mu_\textrm{R}}<\theta_2\tag{3}$

As long as the $$\small{100\,(1-2\,\alpha)}$$ CI lies entirely within the BE margins $$\small{\{\theta_1,\theta_2\}}$$, the Null Hypothesis $$\small{H_0}$$ of inequivalence in $$\small{(3)}$$ is rejected and the Alternative Hypothesis $$\small{H_1}$$ of equivalence in $$\small{(3)}$$ is accepted.

## Models

The totality of data is analyzed with a new term in the analysis of variance (ANOVA), a Treatment × Group interaction term. This is a measure (on a log scale) of how the ratios of test to reference differ in the groups. For example, if the ratios are very much the same in each group, the interaction would be small or negligible. If interaction is large, as tested in the ANOVA, then the groups cannot be combined. However, if at least one of the groups individually passes the confidence interval criteria, then the test product would be acceptable. If interaction is not statistically significant (p > 0.10), then the confidence interval based on the pooled analysis will determine acceptability.
Bolton and Bon (2009)7

Slightly different by the FDA. Details are outlined further down.

### Model III (conventional)

The model for a crossover design in average bioequivalence as stated in guidelines is \eqalign{Y&|\;\text{Sequence},\,\text{Subject}(\text{Sequence}),\\ &\phantom{|}\;\text{Period},\,\text{Treatment}\small{\textsf{,}}}{\tag{III}} where $$\small{Y}$$ is the response, i.e. a certain PK metric. Most regulations recommend an ANOVA, i.e., all effects are fixed. The FDA and Health Canada recommend a mixed-effects model, i.e., to specify $$\small{\text{Subject}(\text{Sequence})}$$ as a random effect and all others fixed.
It must be mentioned that in comparative bioavailability studies subjects are usually uniquely coded. Hence, the nested term $$\small{\text{Subject}(\text{Sequence})}$$ in $$\small{(\text{III})}$$ is a bogus one8 and could be replaced by the simple $$\small{\text{Subject}}$$ as well. See also another article.

Groups and sites are mentioned by the FDA.3

If a crossover study is carried out in two or more groups of subjects (e.g., if for logistical reasons only a limited number of subjects can be studied at one time), the statistical model should be modified to reflect the multigroup nature of the study. In particular, the model should reflect the fact that the periods for the first group are different from the periods for the second group.
If the study is carried out in two or more groups and those groups are studied at different clinical sites, or at the same site but greatly separated in time (months apart, for example), questions may arise as to whether the results from the several groups should be combined in a single analysis.

FDA (2001)3
(my emphases)

In the EEU a modification of the model is mandatory (Article 93), unless a justification is stated in the protocol and discussed with the competent authority (Article 94).5 That’s my interpretation. If you know Russian:

Исследования в нескольких группах
1. Если перекрестное исследование проведено в 2 и более группах субъектов, т.е. разбиение всей выборки на несколько групп, каждая из которых начинает участие в исследовании в разные дни (например, если из логистических со­ображений единовременно в клиническом центре можно провести исследование с участием ограниченного числа субъектов), в целях отражения многогруппового характера исследования необходимо модифицировать статистическую модель. В частности, в модели необходимо учесть тот факт, что периоды для первой гру­ппы отличаются от периодов для второй (и последующих) группы.
2. Если исследование проведено в двух и более группах и эти группы изучались в различных клинических центрах или в одном и том же центре, но были разде­лены большим промежутком времени (например, месяцами), возникает сом­нение относительно возможности объединения результатов, полученных этих группах, в один анализ. Такие ситуации необходимо обсуждать с уполномо­ченным органом.
Если предполагается проведение исследования в нескольких группах из логистических соображений, об этом необходимо явно указать в протоколе исследования; при этом, если в отчете отсутствуют результаты статистического анализа, учитывающие многогрупповой характер исследования, необходимо представить научное обоснование отсутствия таких результатов

Council of the EEC (2020)5

In the EMA’s guideline4 we find only:

The study should be designed in such a way that the formulation effect can be distinguished from other effects.
The precise model to be used for the analysis should be pre-specified in the protocol. The statistical analysis should take into account sources of variation that can be reasonably assumed to have an effect on the response variable.

EMA (2010)4
(my emphasis)

As similar wording is given in all other global guidelines.

### Model II (groups, sites)

If the study is performed in multiple groups or sites, model $$\small{(\text{III})}$$ can be modified to \eqalign{Y&|\;\text{F},\,\;\text{Sequence},\,\text{Subject}(\text{F}\times \text{Sequence}),\\ &\phantom{|}\;\text{Period}(\text{F}),\;\text{F}\times \text{Sequence},\,\text{Treatment}\small{\textsf{,}}}\tag{II} where $$\small{\text{F}}$$ is the code for $$\small{\text{Group}}$$ or $$\small{\text{Site}}$$, respectively.

When $$\small{N_\text{F}}$$ is the number of groups or sites, in $$\small{(\text{II})}$$ there are $$\small{N_\text{F}-1}$$ less residual degrees of freedom than in $$\small{(\text{III})}$$. The table below gives the designs implemented in PowerTOST.

design code t s p m $$\small{(\text{III})}$$ $$\small{(\text{II})}$$
Parallel "parallel" $$\small{2}$$ $$\small{-}$$ $$\small{1}$$ $$\small{1}$$ $$\small{\phantom{0}N-2}$$ $$\small{\phantom{0}N-2-(N_\text{F}-1)}$$
Paired means "paired" $$\small{2}$$ $$\small{-}$$ $$\small{2}$$ $$\small{2}$$ $$\small{\phantom{0}N-1}$$ $$\small{\phantom{0}N-1-(N_\text{F}-1)}$$
Crossover "2x2x2" $$\small{2}$$ $$\small{2}$$ $$\small{2}$$ $$\tfrac{1}{2}$$ $$\small{\phantom{0}N-2}$$ $$\small{\phantom{0}N-2-(N_\text{F}-1)}$$
2-sequence 3-period full replicate "2x2x3" $$\small{2}$$ $$\small{2}$$ $$\small{3}$$ $$\tfrac{3}{8}$$ $$\small{2N-3}$$ $$\small{2N-3-(N_\text{F}-1)}$$
2-sequence 4-period full replicate "2x2x4" $$\small{2}$$ $$\small{2}$$ $$\small{4}$$ $$\tfrac{1}{4}$$ $$\small{3N-4}$$ $$\small{3N-4-(N_\text{F}-1)}$$
4-sequence 4-period full replicate "2x4x4" $$\small{2}$$ $$\small{4}$$ $$\small{4}$$ $$\tfrac{1}{16}$$ $$\small{3N-4}$$ $$\small{3N-4-(N_\text{F}-1)}$$
Partial replicate "2x3x3" $$\small{2}$$ $$\small{3}$$ $$\small{3}$$ $$\tfrac{1}{6}$$ $$\small{2N-3}$$ $$\small{2N-3-(N_\text{F}-1)}$$
Balaam’s "2x4x2" $$\small{2}$$ $$\small{4}$$ $$\small{2}$$ $$\tfrac{1}{2}$$ $$\small{\phantom{0}N-2}$$ $$\small{\phantom{0}N-2-(N_\text{F}-1)}$$
Latin Squares "3x3" $$\small{3}$$ $$\small{3}$$ $$\small{3}$$ $$\tfrac{2}{9}$$ $$\small{2N-4}$$ $$\small{2N-4-(N_\text{F}-1)}$$
Williams’ "3x6x3" $$\small{3}$$ $$\small{6}$$ $$\small{3}$$ $$\tfrac{1}{18}$$ $$\small{2N-4}$$ $$\small{2N-4-(N_\text{F}-1)}$$
Latin Squares or Williams’ "4x4" $$\small{4}$$ $$\small{4}$$ $$\small{4}$$ $$\tfrac{1}{8}$$ $$\small{3N-6}$$ $$\small{3N-6-(N_\text{F}-1)}$$

code is the design-argument in the functions of PowerTOST. t, s, p are the number of treatments, sequences, and periods, respectively. m is the multiplier in the radix of $$\small{(4)}$$ and $$\small{N}$$ is the total number of subjects, i.e., $$\small{N=\sum_{i=1}^{i=s}n_i}$$.

The back-transformed $$\small{1-2\,\alpha}$$ Confidence Interval (CI) is calculated by $CI=100\,\exp\left(\log_{e}\overline{x}_\text{T}-\log_{e}\overline{x}_\text{R}\mp t_{df,\alpha}\sqrt{m \times\widehat{s^2}\sum_{i=1}^{i=s}\frac{1}{n_i}}\,\right)\small{\textsf{,}}\tag{4}$ where $$\widehat{s^2}$$ is the residual variance of the model and $$\small{n_i}$$ is the number of subjects in the $$\small{i^\text{th}}$$ of $$\small{s}$$ sequences.

Therefore, the CI by $$\small{(4)}$$ for given $$\small{N}$$ and $$\small{\widehat{s^2}}$$ by $$\small{(\text{II})}$$ will always be wider (more conservative) than the one by $$\small{(\text{III})}$$ due to the $$\small{N_\text{F}-1}$$ fewer degrees of freedom and hence, larger $$\small{t}$$-value. Note that $$\small{\widehat{s^2}}$$ generally is slightly different in both models.
However, unless the sample size is not small and the number of groups or sites large, the difference in widths of the CIs (and hence, the power of the study) is generally negligible.

### Model I (pre-test)

To turn the argument of the FDA’s guidance3 (see the quote above) on its head, it means that if studies are performed at a single site and groups are not greatly separated in time, the model has not to be modified, i.e., the conventional model $$\small{(\text{III})}$$ of pooled data can be used.

If this is not the case (e.g., months between groups, different sites), regrettably no details about how such a mo­di­fication should be done is given in any guidance. However, this text can be found under the FOI9 and in deficiency letters:

The following statistical model can be applied: \eqalign{Y&|\;\text{Group},\,\text{Sequence},\,\text{Treatment},\\ &\phantom{|}\;\text{Subject}(\text{Group}\times \text{Sequence}),\,\text{Period}(\text{Group}),\\ &\phantom{|}\;\text{Group}\times \text{Sequence},\,\text{Group}\times \text{Treatment}\small{\textsf{,}}}\tag{I} where $$\small{\text{Subject}(\text{Group}\times \text{Sequence})}$$ is a random effect and all other effects are fixed effects.

• If the Group-by-Treatment interaction test is not statistically significant (p ≥0.1), only the Group-by-Treatment term can be dropped from the model. → (II)
• If the Group-by-Treatment interaction is statistically significant (p <0.1), DBE requests that equivalence be demonstrated in one of the groups, provided that the group meets minimum requirements for a complete BE study.
• […] the statistical analysis for BE studies dosed in more than one group should commence only after all subjects have been dosed and all pharmacokinetic parameters have been calculated. Statistical analysis to determine BE within each dosing group should never be initiated prior to dosing the next group […].
• If ALL of the following criteria are met, it may not be necessary to include Group-by-Treatment in the statistical model:
• the clinical study takes place at one site;
• all study subjects have been recruited from the same enrollment pool;
• all of the subjects have similar demographics;
• all enrolled subjects are randomly assigned to treatment groups at study outset.
• In this latter case, the appropriate statistical model would include only the factors
Sequence, Period, Treatment and Subject (nested within Sequence).

Regrettably the WHO stated also:10

In those cases where the subjects are recruited and treated in groups, it is appropriate to investigate the statistical significance of the group-by-formulation interaction e.g., with the following ANOVA model: Group, Sequence, Formulation, Period (nested within Group), Group-by-Sequence interaction, Subject (nested within Group*Sequence) and Group-by-Formulation interaction. If this interaction is significant, the study results are not interpretable. However, it is not considered to be correct to report the results of the 90% confidence interval of the ratio test/comparator based on the standard error derived from this ANOVA. If the group-by-formulation interaction is not significant, the 90% confidence interval should be calculated based on the ANOVA model defined in the protocol. This model may or may not include the group effect as pre-defined in the protocol. This depends on whether the group effect is believed to explain the variability observed in the data.
WHO (2020)15
(my emphases)

On the contrary, my dear Dr. Watson! It is not appropriate.
At least we can state in the protocol that we don’t believe [sic]11 in a group-by-treatment interaction. Believes don’t belong to the realm of science – only assumptions do.

## Degrees of freedom

library(PowerTOST) # attach it to run the examples

In a 2×2×2 crossover design the residual degrees of freedom for $$\small{(4)}$$ in the models are \eqalign{(\text{III}):df&=N-2\\ (\text{II}):df&=N-2-(N_\text{F}-1)\small{\textsf{,}}}\tag{5} where $$\small{N}$$ is total number of subjects and $$\small{N_\text{F}}$$ the number of groups or sites.

Let us explore an example: T/R-ratio (for simplicity equal in all groups) 0.95, CV 0.335, 48 subjects, two to eight groups (or sites).

design <- "2x2x2"
theta0 <- 0.95
CV     <- 0.335
var    <- CV2mse(CV)
n      <- sampleN.TOST(CV = CV, theta0 = theta0, design = design,
print = FALSE)[["Sample size"]]
n1     <- n2 <- n / 2
NF     <- c(2:4, 6, 8)
x      <- data.frame(n = n, NF = NF,
df.3 = n - 2,
CL.lo.3 = NA_real_, CL.hi.3 = NA_real_,
width.3 = NA_real_,
df.2 = n - 2 - (NF - 1),
CL.lo.2 = NA_real_, CL.hi.2 = NA_real_,
width.2 = NA_real_)
for (i in seq_along(NF)) {
x[i, 4:5] <- exp(log(theta0) - log(1) + c(-1, +1) *
qt(1 - 0.05, df = x$df.3[i]) * sqrt(var / 2 * (1 / n1 + 1 / n2))) x[i, 8:9] <- exp(log(theta0) - log(1) + c(-1, +1) * qt(1 - 0.05, df = x$df.2[i]) *
sqrt(var / 2 * (1 / n1 + 1 / n2)))
}
x$width.3 <- x$CL.hi.3 - x$CL.lo.3 x$width.2 <- x$CL.hi.2 - x$CL.lo.2
diffs     <- x$width.2 - x$width.3
names(x)[c(1, 3:5, 7:9)] <- c("N", "df (III)", "CL.lo (III)", "CL.hi (III)",
"df (II)", "CL.lo (II)", "CL.hi (II)")
x[, 4] <- sprintf("%.3f%%", 100 * x[, 4])
x[, 5] <- sprintf("%.3f%%", 100 * x[, 5])
x[, 8] <- sprintf("%.3f%%", 100 * x[, 8])
x[, 9] <- sprintf("%.3f%%", 100 * x[, 9])
print(x[, c(1:5, 7:9)], row.names = FALSE)
cat("Maximum difference in the width of CIs:",
sprintf("%.2g%%.\n", 100 * max(diffs)))
#   N NF df (III) CL.lo (III) CL.hi (III) df (II) CL.lo (II) CL.hi (II)
#  48  2       46     84.955%    106.232%      45    84.951%   106.238%
#  48  3       46     84.955%    106.232%      44    84.946%   106.243%
#  48  4       46     84.955%    106.232%      43    84.942%   106.249%
#  48  6       46     84.955%    106.232%      41    84.932%   106.262%
#  48  8       46     84.955%    106.232%      39    84.920%   106.276%
# Maximum difference in the width of CIs: 0.079%.

As stated above, the difference in the widths of CIs is negligible in any case. If we have just two groups or sites the difference in the widths of CIs is <0.01%.

## Power

The function power.TOST.sds() of PowerTOST supports simulations of models $$\small{\text{(I)}}$$, $$\small{\text{(II)}}$$ and $$\small{\text{(III)}}$$ of studies in a 2×2×2 crossover, as well as in full and partial replicate designs.

The script supports sample size estimation for average bioequivalence and generation of groups based on the clinical capacity, where sequences within groups will be balanced.
The estimated sample size can be adjusted based on an anticipated dropout-rate. If this is selected (do.rate > 0) two simulations are performed:
Realization: Ob­ser­vations (in a sample) of a random variable (of the population).

One for the adjusted sample size (optimistic: no dropouts) and one for the estimated sample size (pessimistic: dropout-rate realized).
Two options for the generation of groups are supported by the logical argument equal:

1. FALSE: Attempts to generate at least one group with the maximum size of the clinical site (default).
2. TRUE:    Attempts to generate equally sized groups.

Last but not least power by $$\small{\text{(II)}}$$ or – if gmodel = 1 the FDA’s decision scheme – is compared to exact power by $$\small{\text{(III)}}$$.
As we have seen in the table above, due to its lower degrees of freedom power of $$\small{\text{(II)}}$$ should always be lower than the one of $$\small{\text{(III)}}$$. If you get a positive change value in the comparison, it is a simulation artifact. In such a case, increase the number of simulations (nsims = 1e6 or higher). Cave: 182 LOC.

sim.groups <- function(CV, theta0 = 0.95, theta1, theta2, target = 0.80,
design = "2x2x2", capacity, equal = FALSE,
gmodel = 2, do.rate = 0, nsims = 1e5L, show = TRUE) {
##########################################################
# Explore the impact on power of a group model com-      #
# pared to the conventional model of pooled data via     #
# simulations.                                           #
# ------------------------------------------------------ #
# capacity: maximum capacity of the clinical site        #
# equal:    TRUE : tries to generate equally sized       #
#                  groups                                #
#           FALSE: tries to get at least one group with  #
#                  the maximum size of the clinical site #
#                  (default)                             #
# do.rate:  anticipated dropout-rate; if > 0 a second    #
#           simulation is performed based on the adjust- #
#           ed sample size                               #
##########################################################
if (missing(theta1) & missing(theta2)) theta1 <- 0.80
if (missing(theta1)) theta1 <- 1 / theta2
if (missing(theta2)) theta2 <- 1 / theta1
if (theta0 < theta1 | theta0 > theta2)
stop("theta0 must be within {theta1, theta2}.", call. = FALSE)
if (theta0 == theta1 | theta0 == theta2)
stop("Simulation of the Type I Error not supported.",
"\n       Use power.TOST.sds() directly.", call. = FALSE)
if (!design %in% c("2x2", "2x2x2", "2x3x3", "2x2x4", "2x2x3"))
stop("Design \"", design, "\" not implemented.", call. = FALSE)
ns <- as.integer(substr(design, 3, 3))
make.equal <- function(n, ns) {
# make equally sized sequences
return(as.integer(ns * (n %/% ns + as.logical(n %% ns))))
}
nadj <- function(n, do.rate, ns) {
return(as.integer(make.equal(n / (1 - do.rate), ns)))
}
grouping <- function(capacity, n, design, equal, do.rate) {
# based on the sample size and capacity, calculate the
# number of groups and subjects / group
if (do.rate == 0) {
x <- n
stop.txt <- paste("The sample size of", x)
}else {
stop.txt <- paste("The adjusted sample size of", x)
}
if (x <= capacity) {
stop(stop.txt, " does not exhaust the clinical capacity.",
call. = FALSE)
}else {
# split sample size into >=2 groups based on capacity
# TODO: Handle a case where with no dropouts <= capacity (no grouping)
#       but with the adjusted sample size two groups are required
if (equal) {# attempt to make all groups equally sized
grps <- ceiling(n / capacity)
tmp  <- rep(n / grps, grps)
ngrp <- make.equal(tmp, ns)
if (!isTRUE(all.equal(tmp, ngrp)))
message("Note: Imbalanced sequences in groups (",
paste(round(tmp, 0), collapse = "|"), ") corrected.\n")
if (sum(ngrp) > n) ngrp[length(ngrp)] <- n-sum(ngrp[-1])
}else {    # at least one group = capacity
ngrp    <- rep(0, ceiling(n / capacity))
grps    <- length(ngrp)
ngrp[1] <- capacity
for (j in 2:grps) {
n.tot <- sum(ngrp) # what we have so far
if (n.tot + capacity <= n) {
ngrp[j] <- capacity
}else {
ngrp[j] <- n - n.tot
}
}
}
}
return(ngrp = list(grps = length(ngrp), ngrp = ngrp))
}
if (equal) {
txt <- "Attempting to generate equally sized groups."
}else {
txt <- paste("Attempting to have at least one group with",
"\nthe maximum capacity of the clinical site.")
}
txt    <- paste(txt, "\nCV                   :", sprintf("%.4f", CV),
"\ntheta0               :", sprintf("%.4f", theta0),
"\nBE-limits            :",
sprintf("%.4f – %.4f", theta1, theta2),
"\nTarget power         :", sprintf("%.2f", target),
"\nDesign               :", design,
"\nClinical capacity    :", capacity)
tmp    <- sampleN.TOST(CV = CV, theta0 = theta0, theta1 = theta1,
theta2 = theta2, targetpower = target,
design = design, print = FALSE)
res    <- data.frame(n = NA_integer_, grps = NA_integer_,
n.grp = NA_integer_, m.1 = NA_real_, m.2 = NA_real_,
change = NA_real_)
res$n <- tmp[["Sample size"]] res[4] <- tmp[["Achieved power"]] x <- grouping(capacity, res$n, design, equal, do.rate)
res[2] <- x[["grps"]]
ngrp   <- x[["ngrp"]]
res[3] <- paste(ngrp, collapse = " | ")
res[5] <- power.TOST.sds(CV = CV, theta0 = theta0, theta1 = theta1,
theta2 = theta2, n = res$n, design = design, grps = res$grps, ngrp = ngrp,
gmodel = gmodel, progress = FALSE, nsims = nsims)
res[6] <- 100 * (res$m.2 - res$m.1) / res$m.1 des <- known.designs()[2:3] df <- as.expression(des$df[des$design == design]) dfs <- rep(NA_integer_, 2) n <- res$n
dfs[1] <- as.integer(eval(parse(text = df)))
dfs[2] <- as.integer(dfs[1] - (res$grps - 1)) txt <- paste(txt, paste0("\n", paste(rep("—", 46), collapse = "")), "\nTotal sample size :", res$n,
"\nNumber of groups     :", sprintf("%2.0f", res[2]),
"\nSubjects per group   :", res[3],
"\nDegrees of freedom   :",
sprintf("%3i", dfs[1]), "(model III)",
"\n                      ",
sprintf("%3i", dfs[2]), "(model II)")
if (do.rate > 0) {
res[2, 1] <- nadj(n, do.rate, ns)
res[2, 4] <- signif(power.TOST(CV = CV, theta0 = theta0, theta1 = theta1,
theta2 = theta2, n = res$n[2], design = design), 4) x <- grouping(capacity, res$n[2], design, equal, do.rate)
res[2, 2] <- x[["grps"]]
ngrp      <- x[["ngrp"]]
res[2, 3] <- paste(ngrp, collapse = " | ")
res[2, 5] <- power.TOST.sds(CV = CV, theta0 = theta0, theta1 = theta1,
theta2 = theta2, n = res$n[2], design = design, grps = res$grps, ngrp = ngrp,
gmodel = gmodel, progress = FALSE, nsims = nsims)
res[2, 6] <- 100 * (res$m.2[2] - res$m.1[2]) / res$m.1[2] dfs <- rep(NA_integer_, 2) n <- res$n[2]
dfs[1]    <- as.integer(eval(parse(text = df)))
dfs[2]    <- as.integer(dfs[1] - (res$grps[2] - 1)) txt <- paste(txt, paste0("\n", paste(rep("—", 46), collapse = "")), "\nAnticip. dropout-rate:", sprintf("%2g%%", 100 * do.rate), "\nAdjusted sample size :", res$n[2],
"\nNumber of groups     :", sprintf("%2.0f", res[2, 2]),
"\nSubjects per group   :", res[2, 3],
"\nDegrees of freedom   :",
sprintf("%3i", dfs[1]), "(model III)",
"\n                      ",
sprintf("%3i", dfs[2]), "(model II)")
}
res[, 4] <- sprintf("%6.4f", res[, 4])
res[, 5] <- sprintf("%6.4f", res[, 5])
res[, 6] <- sprintf("%+.3f%%", res[, 6])
names(res)[4:6] <- c("model III", "model II", "change")
names(res)[c(4, 6)] <- c("model III", "change")
if (gmodel == 1) {
names(res)[5] <- "decision scheme"
}else {
names(res)[5] <- "model II"
}
txt    <- paste0(txt, paste0("\n", paste(rep("—", 46), collapse = "")),
"\nAchieved power of model III and ")
if (gmodel == 1) {
txt <- paste0(txt, "the decision scheme;",
"\nrelative change in power of the decision scheme",
"\ncompared to model III:\n\n")

}else {
txt <- paste0(txt, "II;",
"\nrelative change in power of model II",
"\ncompared to model III:\n\n")
}
if (show) cat(txt)
if (do.rate == 0) {
print(res[, c(1, 4:6)], row.names = FALSE)
}else {
print(res[, c(1, 4:6)])
}
return(result = list(power = res, df = dfs))
}

Say, the assumed CV is 31%, the T/R-ratio 0.95, and we plan the study for ≥ 80% power in a 2×2×2 crossover design. The capacity of the clinical site is 24. We anticipate a dropout-rate of 5% and adjust the sample size accordingly (i.e., dose more subjects in order to have at least as many eligible subjects than estimated for the target power). We want to have at least one group with the capacity of the site.

CV       <- 0.31
theta0   <- 0.95
target   <- 0.80
design   <- "2x2x2"
capacity <- 24
do.rate  <- 0.05
x <- sim.groups(CV = CV, theta0 = theta0, target = target,
design = design, capacity = capacity, do.rate = do.rate)
# Attempting to have at least one group with
# the maximum capacity of the clinical site.
# CV                   : 0.3100
# theta0               : 0.9500
# BE-limits            : 0.8000 – 1.2500
# Target power         : 0.80
# Design               : 2x2x2
# Clinical capacity    : 24
# ——————————————————————————————————————————————
# Total sample size    : 42
# Number of groups     :  2
# Subjects per group   : 24 | 18
# Degrees of freedom   :  40 (model III)
#                         39 (model II)
# ——————————————————————————————————————————————
# Anticip. dropout-rate:  5%
# Adjusted sample size : 46
# Number of groups     :  2
# Subjects per group   : 24 | 22
# Degrees of freedom   :  44 (model III)
#                         43 (model II)
# ——————————————————————————————————————————————
# Achieved power of model III and II;
# relative change in power of model II
# compared to model III:
#
#           n model III model II  change
# Expected 42    0.8113   0.8107 -0.072%
# Adjusted 46    0.8451   0.8428 -0.276%

In $$\small{\text{(II)}}$$ we loose only one degree of freedom compared to $$\small{\text{(III)}}$$. Hence, the loss in power is negligible.

See a rather strange example, where an agency required $$\small{\text{(III)}}$$ – separate for groups – because one of six $$\small{G\times T}$$  tests (three PK metrics of two APIs) was significant.

CV     <- 0.32
theta0 <- 0.95
target <- 0.90
design <- "2x2x2"
x      <- sampleN.TOST(CV = CV, theta0 = theta0, targetpower = target,
design = design, print = FALSE)
n      <- x[["Sample size"]]
power  <- x[["Achieved power"]]
x      <- data.frame(groups = 1:3, n.group = n / 1:3,
power.group = c(power, rep(NA_real_, 2)))
for (i in 2:3) {
x$power.group[i] <- suppressMessages( power.TOST(CV = CV, theta0 = theta0, n = x$n.group[i],
design = design))
}
names(x)[2:3] <- c("n/group", "power/group")
print(x, row.names = FALSE)
#  groups n/group power/group
#       1      60   0.9080189
#       2      30   0.6212292
#       3      20   0.3626223

There were only two groups and the company was lucky because T/R-ratios and CVs were ‘better’ than assumed, as well as the dropout-rate lower than anticipated in the sample size estimation. Impossible that all would have passed if there would have been three groups (power < 50% is always a failure).

Let’s assume you are a victim of an agency requiring the FDA’s decision scheme ($$\small{\text{I}}$$ → $$\small{\text{II}}$$ or $$\small{\text{III}}$$).

# Cave: extremely long runtime
CV       <- 0.31
theta0   <- 0.95
target   <- 0.80
design   <- "2x2x2"
capacity <- 24
do.rate  <- 0.05
x <- sim.groups(CV = CV, theta0 = theta0, target = target,
design = design, capacity = capacity, do.rate = do.rate,
gmodel = 1)
# Attempting to have at least one group with
# the maximum capacity of the clinical site.
# CV                   : 0.3100
# theta0               : 0.9500
# BE-limits            : 0.8000 – 1.2500
# Target power         : 0.80
# Design               : 2x2x2
# Clinical capacity    : 24
# ——————————————————————————————————————————————
# Total sample size    : 42
# Number of groups     :  2
# Subjects per group   : 24 | 18
# Degrees of freedom   :  40 (model III)
#                         39 (model II)
# ——————————————————————————————————————————————
# Anticip. dropout-rate:  5%
# Adjusted sample size : 46
# Number of groups     :  2
# Subjects per group   : 24 | 22
# Degrees of freedom   :  44 (model III)
#                         43 (model II)
# ——————————————————————————————————————————————
# Achieved power of model III and the decision scheme;
# relative change in power of the decision scheme
# compared to model III:
#
#           n model III decision scheme  change
# Expected 42    0.8113          0.7653 -5.674%
# Adjusted 46    0.8451          0.7924 -6.232%

Note that we simulated no Group-by-Treatment interaction. Nevertheless, in 10% of cases the $$\small{G\times T}$$  test will be significant by pure chance and BE only assessed in the largest group of 24 subjects. Hence, overall we loose power. Even worse so, if that happens, in such a group we have only ≈52.1% power. That’s a recipe for disaster and hardly better than tossing a coin.

## Type I Error

The Type I Error can be assessed by setting $$\small{\theta_0}$$ to one of the BE margins $$\small{\left\{\theta_1,\theta_2\right\}}$$, i.e., assume that the Null hypothesis is true. This can be done either by simulations with the function power.TOST.sds() or exact for $$\small{\text{(II)}}$$ and $$\small{\text{(III)}}$$ with the function power.TOST() (in the latter after adjusting the degrees of freedom as outlined above). We need at least 106 simulations in order to obtain a stable result.

# Cave: really extreme long runtime
CV     <- 0.31
design <- "2x2x2"
theta1 <- 0.80
theta2 <- 1.25
ngrp   <- c(24L, 18L)
grps   <- 2
n      <- sum(ngrp)
gmodel <- 3:1
nsims  <- 1e6
adj    <- c(n, n - (grps - 1))
x      <- data.frame(model = c("III", "II", "Decision scheme"),
df = c(n - 2, n - 2 - (grps - 1), NA),
simulated = NA_real_, exact = NA_real_)
for (i in seq_along(gmodel)) {
x$simulated[i] <- power.TOST.sds(CV = CV, theta0 = theta2, n = n, design = design, grps = grps, ngrp = ngrp, gmodel = gmodel[i], nsims = nsims, progress = FALSE) if (i < 3) { x$exact[i] <- suppressMessages(
power.TOST(CV = CV, theta0 = theta2,
n = adj[i], design = design))
}
}
cat("CV                :", sprintf("%.4f", CV),
"\nBE-limits         :",
sprintf("%.4f – %.4f", theta1, theta2),
"\nDesign            :", design,
"\nTotal sample size :", n,
"\nNumber of groups  :", sprintf("%2.0f", grps),
"\nSubjects per group:", paste(ngrp, collapse = " | "),
"\nNull assessed at  :", sprintf("%.4f\n\n", theta2))
print(x, row.names = FALSE, right = FALSE)
# CV                : 0.3100
# BE-limits         : 0.8000 – 1.2500
# Design            : 2x2x2
# Total sample size : 42
# Number of groups  :  2
# Subjects per group: 24 | 18
# Null assessed at  : 1.2500
#
#  model           df simulated exact
#  III             40 0.049979  0.04999970
#  II              39 0.049947  0.04999953
#  Decision scheme NA 0.062646          NA

Of course, $$\small{\text{(III)}}$$ controls the Type I Error. Due to less degrees of freedom the Type I Error is even slightly lower in $$\small{\text{(II)}}$$. As expected, in the decision scheme the Type I Error is significantly inflated (the limit of the binomal test is 0.05036). A bit provocative: The relative consumer risk increases by ≈25.3%.

# Examples

## Simulations

We can simulate studies in groups and assess them for the $$\small{p(G\times T)}$$ interaction. Naïve pooling of data is valid in the strict sense only if all groups have the same size and the PEs of groups would be identical. Cave: 219 LOC.

# Cave: long runtime (~10 minutes for 1e5 simulations)
sim.GxT <- function(CV, theta0 = 0.95, theta1 = 0.8, theta2 = 1.25,
target = 0.8, groups = 2, capacity, split = c(0.5, 0.5),
mue =  c(0.95, 1 / 0.95), level = 0.1, setseed = TRUE,
nsims = 1e5, progr = FALSE, leg = TRUE, print = FALSE,
details = FALSE) {
require(PowerTOST)
require(shape)
#######################################################
# Performs simulations of the G×T interaction test of #
# 2×2×2 crossover studies.                            #
# Model 1: Group, Sequence, Treatment,                #
#          Subject (nested within Group × Sequence),  #
#          Period (nested within Group),              #
#          Group-by-Sequence Interaction,             #
#          Group-by-Treatment Interaction             #
# ANOVA (all effects fixed)                           #
#######################################################
# CV       : assumed intra-subject CV; can be a two-element vector - in
#            this case the sample size is estimated based on the pooled CV
# theta0   : assumed T/R-ratio (default 0.95)
# theta1   : lower BE-limit (default 0.80)
# theta2   : upper BE-limit (default 1.25)
# target   : targetpower (default 0.80)
# groups   : number of groups (default 2)
# capacity : capacity of clinical site
# split    : group sizes / total sample size (default c(0.5, 0.5))
#            must be a vector, where
#            length(split) == groups & sum(split) == 1
#            Note: May lead to unbalanced sequences within groups!
# mue      : GMRs of groups
#            must be a vector, where length(mue) == groups
#            if all elements are equal: no G×T interaction
# level    : level of the G×T test (default 0.1)
# setseed  : should a fixed seed issued? (default TRUE)
# nsims    : number of simulations (default 1e5)
# progr    : should a progress bar be shown? (default FALSE)
# leg      : should legend in the plot be used? (default TRUE)
# print    : should summary of p(G×T) be shown? (default FALSE)
# details  : should the runtime be shown? (default FALSE)
#######################
# Generate study data #
#######################
group.data <- function(CV = CV, mue = mue, n.group = n.group,
capacity = capacity) {
if (length(n.group) < 2) stop("At least two groups required.")
if (max(n.group) > capacity)
warning("Largest group exceeds capacity of site!")
subject <- rep(1:sum(n.group), each = 2)
group   <- period <- treatment <- sequence <- NULL
for (i in seq_along(n.group)) {
sequence  <- c(sequence, c(rep("TR", n.group[i]),
c(rep("RT", n.group[i]))))
treatment <- c(treatment, rep(c("T", "R"), ceiling(n.group[i] / 2)),
rep(c("R", "T"), floor(n.group[i] / 2)))
period    <- c(period, rep(c(1:2), ceiling(n.group[i] / 2)),
rep(c(1:2), floor(n.group[i] / 2)))
group     <- c(group, rep(i, ceiling(n.group[i])),
rep(i, floor(n.group[i])))
}
data        <- data.frame(subject, group, sequence, treatment, period,
Y = NA_real_)
for (i in seq_along(n.group)) {
if (length(CV) == 1) {# homogenicity
data$Y[data$group == i & data$treatment == "T"] <- exp(mue[i] + rnorm(n = n.group[i], mean = 0, sd = CV2se(CV))) data$Y[data$group == i & data$treatment == "R"] <-
exp(1 + rnorm(n = n.group[i], mean = 0, sd = CV2se(CV)))
}else {              # heterogenicity
data$Y[data$group == i & data$treatment == "T"] <- exp(mue[i] + rnorm(n = n.group[i], mean = 0, sd = CV2se(CV[1]))) data$Y[data$group == i & data$treatment == "R"] <-
exp(1 + rnorm(n = n.group[i], mean = 0, sd = CV2se(CV[2])))
}
}
facs   <- c("subject", "group", "sequence", "treatment", "period")
data[facs] <- lapply(data[facs], factor)
return(data)
}
########################
# Initial computations #
########################
if (length(CV) == 1) {
CVp <- CV
}else {
if (length(CV) == 2) {
CVp <- mse2CV(mean(c(CV2mse(CV[1]), CV2mse(CV[2]))))
}else {
stop ("More than two CVs not supported.")
}
}
x       <- sampleN.TOST(CV = CVp, theta0 = theta0, theta1 = theta1,
theta2 = theta2, design = "2x2x2",
targetpower = target, print = FALSE)
n       <- x[["Sample size"]]
pwr     <- x[["Achieved power"]]
sig     <- 0               # counter of significant GxT interactions
p.GxT   <- numeric(length = nsims)
n.group <- as.integer(n * split)
if (sum(n.group) < n) {   # increase size of last group if necessary
n.group[groups] <- n.group[groups] + n - sum(n.group)
}                         # TODO: Check & correct (add another group?)
if (setseed) set.seed(123456)
rt      <- proc.time()[[3]]
if (progr) pb <- txtProgressBar(style = 3)
###############
# Simulations #
###############
ow      <- options() # safe defaults
options(contrasts = c("contr.treatment", "contr.poly"), digits = 12)
for (sim in 1:nsims) {
data   <- group.data(CV = CV, mue = mue, n.group = n.group,
capacity = capacity)
model1 <- lm(log(Y) ~ group +
sequence +
treatment +
subject %in% (group*sequence) +
period %in% group +
group:sequence +
group:treatment,
data = data)
p.GxT[sim] <- anova(model1)[["group:treatment", "Pr(>F)"]]
if (p.GxT[sim] < level) {# significant GxT interaction
sig <- sig + 1
}
if (progr) setTxtProgressBar(pb, sim / nsims)
}
options(ow)          # restore defaults
rt      <- signif((proc.time()[[3]] - rt) / 60, 3)
if (progr) close(pb)
# Kolmogorov-Smirnov test:
# exact if x < 100 and no ties, approximate otherwise
ks      <- ks.test(x = p.GxT, y = "punif", 0, 1)
################
# Prepare plot #
################
plot.unif <- qqplot(x = qunif(ppoints(nsims)),
y = p.GxT, plot.it = FALSE) # coordinates
main      <- "Model (I), all effects fixed,"
if (length(unique(mue)) == 1) {
main <- paste(main, "no G\u00D7T interaction\n")
}else {
main <- paste(main, "\u201Ctrue\u201D G\u00D7T interaction\n")
}
main <- paste0(main, groups, " groups (",
paste(n.group, collapse=", "), "), ")
if (length(unique(mue)) == 1) {
if (groups == 2) {
main <- paste0(main, "GMR of both groups ",
paste(sprintf("%.4f", mue[1]), collapse=", "), "\n")
}else {
main <- paste0(main, "GMR of all groups ",
paste(sprintf("%.4f", mue[1]), collapse=", "), "\n")
}
}else {
main <- paste0(main, "GMRs of groups ",
paste(sprintf("%.4f", mue), collapse=", "))
if (length(unique(n.group)) == 1) {
main <- paste0(main, "\npooled GMR ", sprintf("%.4f", sqrt(prod(mue))))
}else {
main <- paste0(main, "\nweighted GMR ",
sprintf("%.4f", prod(mue^n.group)^(1/sum(n.group))))
}
}
main <- paste0(main, "\np (G\u00D7T) <", level, " in ",
sprintf("%.2f%%", 100 * sig / nsims), " of ",
formatC(nsims, format = "d", big.mark = ","),
" simulated studies")
if (ks$p.value == 0) { sub <- paste0(ks$method, sprintf(": p <%.2g", .Machine$double.eps)) }else { sub <- paste0(ks$method, sprintf(": p %.4f", ks$p.value)) } ######## # Plot # ######## dev.new(width = 4.6, height = 4.6) op <- par(no.readonly = TRUE) # safe graphics defaults par(pty = "s", cex.main = 0.9, cex.sub = 0.9, font.main = 1, cex.lab = 1, font.main = 1, cex.axis = 0.8, family = "sans") plot(x = c(0, 1), y = c(0, 1), type = "n", axes = FALSE, main = main, sub=sub, xlab = "uniform [0, 1] quantiles", ylab = expression(italic(p) * "(G\u00D7T)")) axis(1) axis(1, at = seq(0, 1, 0.05), tcl = -0.25, labels = FALSE) axis(2, las = 1) axis(2, at = seq(0, 1, 0.05), tcl = -0.25, labels = FALSE) grid() abline(a = 0, b = 1, col="lightgray") # unity line abline(h = level, lty = 2) # level of the G×T test if (leg) { par(family = "mono") legend("topleft", bg = "white", cex = 0.8, x.intersp = 0, box.lty = 0, legend = c(paste(sprintf("%5.2f%%", 100 * CV), " CV"), paste(sprintf("% 1.4f", theta0), "theta0"), paste(sprintf("%5.2f%%", 100 * target), " target power"), paste(sprintf("%5i", n), " sample size"), paste(sprintf("%5.2f%%", 100 * pwr), " power"))) par(family = "sans") } points(plot.unif$x[plot.unif$y < level], plot.unif$y[plot.unif$y < level], col="red", pch = 19, cex = 0.05) points(plot.unif$x[plot.unif$y >= level], plot.unif$y[plot.unif$y >= level], col="blue", pch = 19, cex = 0.05) x <- max(plot.unif$x[plot.unif$y < level]) y <- max(plot.unif$y[plot.unify < level]) Arrows(x, y, x, par("usr")[1], lwd = 1, arr.length = 0.25, arr.width = 0.15, arr.adj = 1, arr.type = "triangle", col = "black", arr.col = "darkgrey") mtext(sprintf("%.4f", sig / nsims), side = 1, line = 0.1, at = sig / nsims, cex = 0.8) box() par(op) # restore graphics defaults if (details) cat("Runtime for", formatC(nsims, format = "d", big.mark = ","), "simulations:", rt, "minutes\n") if (print) round(summary(p.GxT), 6) } Let’s simulate a study with CV 0.335, T ≡ R, two groups, capacity of the clinical site 24, and target ≥ 90% power. Furthermore, we assume that in both groups T ≡ R, i.e., no Group-by-Treatment interaction. sim.GxT(CV = 0.335, theta0 = 1, target = 0.90, capacity = 24, mue = rep(1, 2), leg = FALSE) As expected. Although there is no true Group-by-Treatment interaction, it is detected at approximately the level of test. Hence, these ≈10% of cases are definitely false positives. When following the FDA’s decision scheme, only one group will be assessed for BE by model $$\small{\text{(III)}}$$, compromising power as we have seen above. Power of one group would be only 48.6% and hence, the study falsely considered a failure. Theoretically $$\small{p(G\times T)}$$ should be uniformly distributed with $$\small{\in\left\{0,1\right\}}$$. As confirmed by the Kolmogorov–Smir­nov test, they are. Let’s play the devil’s advocate. A true Group-by-Treatment interaction, where the T/R-ratio in one group is the reciprocal of the other. Since groups are equally sized, in an analysis of pooled data by either model $$\small{\text{(II)}}$$ or $$\small{\text{(III)}}$$ we will estimate T = R. sim.GxT(CV = 0.335, theta0 = 1, target = 0.90, capacity = 24, mue = c(0.95, 1 / 0.95), leg = FALSE) As expected, again. Since there is a true Group-by-Treatment interaction, it is detected in ≈19% of cases. So far, so good. When following the FDA’s decision scheme, the T/R-ratio deviating from our assumptions will be evident because only one group will be assessed by model $$\small{\text{(III)}}$$ for BE. However, in ≈81% of cases the Group-by-Treat­ment interaction will not be detected. That’s due to the poor power of the test. Consequently, the pooled data will be assessed by model $$\small{\text{(II)}}$$ and we will estimate T = R, which is wrong because here we know that groups differ in their treatment effects. In a real study we have no clue. Do I hear a whisper in the back row »Can’t we simply calculate the CIs of the groups and compare them?« You could but please only at home. Since we have extremely low power, very likely the CIs will overlap. Even if not, what would you conclude? That would be yet another pre-test with all its nasty consequences. Welcome to the world of doubts. [The] impatience with ambiguity can be criticized in the phrase: absence of evidence is not evidence of absence. Carl Sagan (1995)12 Not enough? Two groups (38 and 10 subjects), extreme Group-by-Treatment interaction (true T/R-ratio in the first group 1.0605 and the second 0.80, which is the Null). sim.GxT(CV = 0.335, theta0 = 1, target = 0.90, split = c(1-10/48, 10/48), capacity = 40, mue = c(1.0604803726, 0.80), leg = FALSE) Okay… With the FDA’s decision scheme in almost 50% of cases we will evaluate only the large group and grum­pily accept the ≈68% power. Substantially lower than the 90% we hoped for but who cares if the study passes? Wait a minute – what about the second group? Ignore it, right? Of course, there is also a ≈50% chance that nothing suspicious will be detected and the T/R-ratio estimated as 1. No hard feelings! What if not only the T/R-ratios are different but also the CVs? Let’s try CV ≈0.37 in the large group and CV ≈0.30 in the small one. sim.GxT(CV = CVp2CV(0.335, ratio = 1.5), theta0 = 1, target = 0.90, split = c(1-10/48, 10/48), capacity = 40, mue = c(1.0604803726, 0.80), leg = FALSE) Exactly like before because the crossover models assume homoscedasticity. ## Meta-study A small meta-study: 8613 comparative bioavailability studies (BE, food-effect, DDI), crossovers with two to five groups, median sample size 24 (15 – 143), median interval separating the groups three days (one to 18 days), 75 analytes. Assessment of $$\small{\log_{e}\textsf{-}}$$transformed PK metrics with the conventional acceptance range of 80.00 – 125.00%. Not representative, of course. Perhaps I’m a victim of selection bias because regularly I get corpses on my desk to perform a – rarely useful – autopsy. To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of. Ronald A. Fisher (1938)14 ‘Bias’ and ‘precision’ of model $$\small{\text{(II)}}$$ were calculated according to D’Angelo et al.,15 which are not the usual measures of bias and precision (calculated using the expected values of the true and estimated population means) by \eqalign{ \text{bias}&=100\,\frac{w\,(\text{II})-w\,(\text{III})}{w\,(\text{III})}\\ \text{precison}&=100\,\frac{\left|\,w\,(\text{II})-w\,(\text{III})\,\right|}{w\,(\text{III})}\small{,}}\tag{6} where $$\small{w}$$ is the width of the 90% CI obtained by model $$\small{\text{(II)}}$$ and $$\small{\text{(III)}}$$, respectively. As expected, significant Group-by-Treatment interactions were detected at approximately the level of the test. Hence, based on our observations in well-controlled studies likely they are mere ‘statistical artifacts’, i.e., false positives. The Kolmogorov–Smirnov tests are not significant, confirming the expected uniform distribution of $$\small{p(G\times T)}$$. Furthermore, no significant correlations of $$\small{p(G\times T)}$$ with the sample size, number of groups, interval between groups, and sex gender were found. Below a summary of the ‘bias’ and ‘precision’ of model $$\small{\text{(II)}}$$, as well as passing rates by models $$\small{\text{(III)}}$$ and $$\small{\text{(II)}}$$. $\small{\begin{array}{llcccc} \textsf{metric} & \textsf{data sets} & \textsf{bias} & \textsf{precision} & \text{III} & \text{II}\\\hline AUC & \phantom{00}102 & +7.2\% & 11\% & 86.3\% & 77.5\%\\ C_\text{max} & \phantom{00}104 & +3.5\% & 10.6\% & 67.3\% & 58.7\% \end{array}}$ Less studies passed when evaluated by model $$\small{\text{(II)}}$$ than by model $$\small{\text{(III)}}$$. This is not only due to the fewer degrees of freedom but also due to different residual errors ($$\small{\widehat{s^2}}$$), a finding similar to another meta-study,15 where fewer studies passed with the carry­over-term in the model than without. It is not possible to predict whether the additional group-terms by model $$\small{\text{(II)}}$$ can ‘explain’ part of the variability, i.e., its $$\small{\widehat{s^2}}$$ may be smaller or larger than the one of model $$\small{\text{(III)}}$$. $\small{\widehat{s^2}\:\:\:\begin{array}{|lcccccccc} \textsf{metric} & \textsf{model} & \text{minimum} & 2.5\% & \text{Q I} & \text{median} & \text{Q III} & 97.5\% & \text{maximum}\\\hline AUC & \text{III} & 0.0021 & 0.0038 & 0.0133 & 0.0298 & 0.0558 & 0.1585 & 0.3228\\ & \text{ II} & 0.0022 & 0.0039 & 0.013 & 0.0274 & 0.0552 & 0.161 & 0.3072\\ C_\text{max} & \text{III} & 0.0046 & 0.0069 & 0.025 & 0.057 & 0.0862 & 0.2198 & 0.4655\\ & \text{ II} & 0.0048 & 0.0069 & 0.0249 & 0.0541 & 0.0893 & 0.2261 & 0.4513 \end{array}}$ $$\small{\widehat{s^2}}$$ by $$\small{\text{(II)}}$$ was larger than by model $$\small{\text{(III)}}$$ in ≈58% of the AUC data sets and in ≈60% of the Cmax data sets. The larger $$\small{\widehat{s^2}}$$ in the majority of data sets by $$\small{\text{(II)}}$$ is reflected in the lower passing rate. In one of the AUC data sets (18 subjects in each of two groups and 17 in one) $$\small{\widehat{s^2}}$$ increased from 0.07130 by model $$\small{\text{(III)}}$$ to 0.07395 by model $$\small{\text{(II)}}$$. Together with 49 degrees of freedom instead of 51, the upper CL is 125.13% instead of 124.90%. Hence, the study would be considered a failure by model $$\small{\text{(II)}}$$ though it passed by model $$\small{\text{(III)}}$$. On the other hand, model $$\small{\text{(II)}}$$ may recover information indeed. In a Cmax data set (a che­mo­therapeutic in multiple sites: 3, 3, 4, 5, and 14 subjects / site) $$\small{\widehat{s^2}}$$ decreased substantially from 0.2677 by model $$\small{\text{(III)}}$$ to 0.1630 by model $$\small{\text{(II)}}$$. The study would have [sic] passed with an upper CL of 110.49% but was a disaster with 149.03% according to the protocol – ignoring the multi-site structure. Don’t worry, it’s too late. It must be mentioned that some studies were powered for wider acceptance ranges of 75.00 – 133.33% or 70.00 – 142.86% for Cmax, thus clarifying the low passing rates of this PK metric when assessed with the conventional acceptance range. ## Fake Groups AUC data from the literature.16 54 subjects, four period full replicated design. I assessed only the 52 subjects with complete data (excluded #3 and #27). I assigned randomly subjects to two ‘fake groups’ (12 – 40 subjects) and assessed the $$\small{G\times T}$$ test. Below only significant results where the ‘fake groups’ had similar sizes. data <- data.frame(subject = c(rep(1L:2L, each = 4), rep(3L, 2), rep(4L:13L, each = 4), rep(15L:26L, each = 4), rep(27L, 2), rep(28L:40L, each = 4), rep(42L:50L, each = 4), rep(52L:57L, each = 4)), period = c(rep(1L:4L, 2), 1L:2L, rep(1L:4L, 10), rep(1L:4L, 12), 1L:2L, rep(1L:4L, 13), rep(1L:4L, 9), rep(1L:4L, 6)), sequence = c(rep("RTRT", 4), rep("TRTR", 4), rep("RTRT", 2), rep("TRTR", 4), rep("RTRT", 8), rep("TRTR", 12), rep("RTRT", 4), rep("TRTR", 4), rep("RTRT", 4), rep("TRTR", 4), rep("RTRT", 4), rep("TRTR", 4), rep("RTRT", 8), rep("TRTR", 8), rep("RTRT", 4), rep("TRTR", 8), rep("TRTR", 8), rep("TRTR", 6), rep("RTRT", 8), rep("TRTR", 4), rep("RTRT", 4), rep("TRTR", 8), rep("RTRT", 8), rep("TRTR", 12), rep("RTRT", 8), rep("TRTR", 8), rep("RTRT", 4), rep("TRTR", 4), rep("RTRT", 4), rep("TRTR", 4), rep("RTRT", 12), rep("TRTR", 4), rep("RTRT", 8), rep("TRTR", 8), rep("RTRT", 4)), treatment = NA_character_, group = NA_integer_, AUC = c(812.6, 1173.7, 889.1, 620.1, 216.3, 338.0, 502.8, 398.6, 545.1, 542.9, 632.6, 520.0, 716.7, 860.4, 400.0, 223.8, 173.7, 289.7, 102.1, 185.3, 42.0, 88.3, 596.0, 659.3, 543.8, 662.9, 402.4, 359.8, 590.8, 444.3, 456.7, 378.4, 477.5, 407.9, 304.5, 351.5, 520.2, 335.7, 500.7, 323.0, 416.3, 525.1, 176.1, 710.7, 409.5, 645.5, 160.6, 218.0, 170.1, 124.6, 562.4, 490.4, 504.7, 675.9, 756.0, 606.8, 477.4, 626.8, 207.5, 271.6, 173.7, 240.5, 571.3, 705.2, 619.0, 633.6, 511.9, 549.7, 388.2, 141.0, 124.0, 91.9, 113.3, 59.5, 536.1, 595.2, 445.5, 521.5, 239.7, 265.1, 445.9, 433.2, 609.6, 371.6, 511.3, 432.7, 449.9, 860.4, 606.8, 577.2, 192.5, 220.1, 233.1, 227.0, 764.4, 508.8, 757.8, 449.4, 151.9, 194.8, 568.1, 321.1, 338.3, 403.6, 735.6, 634.5, 1244.2, 641.9, 429.1, 391.8, 316.9, 335.1, 307.4, 481.8, 346.6, 369.7, 409.0, 514.6, 763.1, 406.5, 271.0, 221.0, 296.5, 463.7, 292.9, 431.0, 448.5, 267.8, 217.2, 332.2, 103.0, 127.5, 290.8, 208.6, 243.7, 489.7, 297.2, 502.0, 320.4, 334.3, 163.8, 232.1, 636.9, 434.9, 368.3, 292.6, 446.1, 222.3, 193.7, 202.8, 255.2, 244.3, 534.1, 243.1, 418.4, 441.9, 355.1, 415.2, 382.7, 334.0, 102.0, 282.5, 245.6, 286.2, 320.5, 233.9, 331.7, 260.5, 223.6, 645.4, 349.0, 507.4, 504.5, 289.9, 550.7, 244.2, 615.8, 732.1, 620.9, 665.2, 898.4, 924.9, 398.3, 828.3, 410.4, 329.2, 449.4, 442.1, 237.0, 505.0, 496.3, 580.6, 332.4, 273.6, 525.3, 293.3, 185.2, 222.9, 182.1, 194.1, 246.9, 620.9, 678.3, 752.2, 235.4, 190.4, 318.3, 248.4, 180.6, 174.7, 102.9, 117.0)) for (i in 1:nrow(data)) {# extract treatments from sequences and periods treatments <- unlist(strsplit(datasequence[i], split = ""))
data$treatment[i] <- treatments[data$period[i]]
}
subj <- unique(data$subject) keep <- numeric(0) # keep only subjects with complete data for (i in seq_along(subj)) { tmp <- data[data$subject == subj[i], 2]
if (length(data$period[data$subject == subj[i]]) == 4) keep[i] <- subj[i]
}
keep             <- keep[!is.na(keep)]
data             <- data[data$subject %in% keep, ] n.subj <- as.integer(length(unique(data$subject)))
data$subject <- rep(1L:n.subj, each = 4) # easier facs <- c("subject", "sequence", "treatment", "period", "group") ow <- options() options(contrasts = c("contr.treatment", "contr.poly"), digits = 12) nsims <- 2500L res <- data.frame(n.grp1 = rep(NA_integer_, nsims), n.grp2 = NA_integer_, p.GxT = NA_real_, sig = FALSE) set.seed(123456) n.grps <- round(runif(n = nsims, min = 12, max = n.subj - 12)) for (i in 1:nsims) { tmp <- data select <- round(runif(n = n.grps[i], min = 12, max = n.subj - 12)) res$n.grp1[i] <- n.grps[i]
res$n.grp2[i] <- n.subj - res$n.grp1[i]
tmp$group <- 2L tmp$group[tmp$subject %in% select] <- 1L tmp[facs] <- lapply(tmp[facs], factor) model1 <- lm(log(AUC) ~ group + sequence + treatment + subject %in% (group*sequence) + period %in% group + group:sequence + group:treatment, data = tmp) res$p.GxT[i]  <- anova(model1)[["group:treatment", "Pr(>F)"]]
}
options(ow)
res                      <- unique(res)
res$sig[res$p.GxT < 0.1] <- TRUE
res$p.GxT <- round(res$p.GxT, 6)
res                      <- res[with(res, order(p.GxT, n.grp1)), ]
# print(res, row.names = FALSE)
# sum(res$sig) / nsims print(res[res$n.grp1 >= 25 & res$n.grp1 <= 27 & res$n.grp2 >= 25 & res$n.grp2 <= 27 & res$sig == TRUE, ], row.names = FALSE)
#  n.grp1 n.grp2    p.GxT  sig
#      27     25 0.025003 TRUE
#      25     27 0.033413 TRUE
#      27     25 0.058473 TRUE
#      25     27 0.063708 TRUE
#      26     26 0.072094 TRUE
#      27     25 0.076887 TRUE
#      25     27 0.085086 TRUE
#      26     26 0.090428 TRUE
#      27     25 0.091810 TRUE
#      27     25 0.095832 TRUE

Here 5.04% of cases showed a significant Group-by-Treatment interaction.
Recall: The study was not performed in groups! Hit by Murphy.

# Summary

In multi-site studies the model should always be modified. → $$\small{\text{(II)}}$$

• The FDA’s decision scheme based on the above is only given for multi-group studies. When it comes to multi-site studies, how should we interpret »questions may arise as to whether the results from the several groups should be combined in a single analysis«?
Cave: Intra-subject contrasts for the estimation of the treatment effect (and hence, the PE and its CI) cannot be unbiased obtained from (I).10 The G × T  test serves only as a tool in the decision scheme.
• As any pre-test it inflates the family-wise error rate (i.e., the probability α of the Type I Error, the consumer risk ) and hence, its application is statistically flawed from the start. That’s similar to testing for unequal carryover (aka a ‘sequence effect’), which was proved to be false17 (see also this article).
• Even if there is a true Group-by-Treatment, it will not be detected in the majority of cases and hence, application of (II) questionable.
• As shown in the simulations the power of the G × T test is poor. If there is no true Group-by-Treat­ment interaction, it will be falsely detected in 10% of cases, seriously compromising power.
• »[…] it may not be necessary […]« leaves room for interpretation.
• Similar demographics are not relevant in crossover studies, since each subject acts as its own reference. In parallel designs such a requirement is trivial (applicable to single group-/site-studies as well).
• »One of the groups« possibly refers to the largest one. It is unclear how to proceed if there are equally sized groups. Most conservative would be to follow the intersection-union principle18 and require demonstration of BE in all of them.
Bolton and Bon7 state »at least one of the groups«, which must not necessarily be the largest.

Essentially, a statement in the context of the ‘sequence effect’ is applicable here as well.

Testing for carryover in bioequivalence studies […] is not recommended and, moreover, can be harmful. It seems that whenever carry-over is ‘detected’ under such conditions, it is a false positive and researchers will be led to use an inferior estimate, abandoning a superior one.
Senn et al. (2004)19

North American CROs routinely evaluate $$\small{\text{(I)}}$$ but with no consequences if the $$\small{G\times T}$$  test is significant, i.e., for the assessment of BE model $$\small{\text{(II)}}$$ is used. Very few deficiency letters issued by the FDA. Most we have seen so far were received by European CROs. Politics?

All animals are equal, but some animals are more equal than others.
George Orwell (1945)20

Good news from the other side of the pond: The FDA accepted in a ‘Type A’ meeting in April 2021 our arguments, especially about the inflated Type I Error.21 Hence, for our study the pre-test $$\small{\text{(I)}}$$ in the decision scheme is not required and $$\small{\text{(II)}}$$ can be used.

Given the outcome of the small meta-study, significant Group-by-Treatment interactions are most likely false positives. To avoid problems with the FDA, in some MENA-states (especially Saudi Arabia), the EEU, and recently with some Euro­pean assessors, we recommend to opt for model $$\small{\text{(II)}}$$ in any case. The impact on power compared to the conventional model $$\small{\text{(III)}}$$ is acceptable and its application avoids lengthy and fruitless discussions.

In the context of Two-Stage Designs the EMA stated in the Q&A document:22

Discussion
A model which also includes a term for a formulation*stage interaction would give equal weight to the two stages, even if the number of subjects in each stage is very different. The results can be very misleading hence such a model is not considered acceptable. Further­more, this model assumes that the formulation effect is truly different in each stage. If such an assumption were true there is no single formulation effect that can be applied to the general population, and the estimate from the study has no real meaning.

Conclusion
3) A term for a formulation*stage interaction should not be fitted.

EMA (2013)22
(my emphases)

We replaced ‘stage’ by ‘group’ in successfully answering deficiency letters.

In a deficiency letter of 201823 related to a multi-site study of a biosimilar ($$\small{\text{I}}$$ → $$\small{\text{III}}$$) the EMA stated:

• A sub-set of patients cannot be selected for the BE analysis on the basis of tests for a treatment-by-site interaction. It is questioned whether this approach produces an unbiased estimate as the chosen group may no longer be representative of the initial intended study population. The Type I error is not controlled when the procedure […] is only related to the finally selected population/model.
• In a multi-site study Model II is preferred, and that should be used for the BE analysis regardless of the results of any interaction tests.

Interpreting the guideline:4

»Can [it] be reasonably assumed« that a study performed in the same CRO, quite often groups separated by only a couple of days, samples analyzed with the same bioanalytical method in the same lab, in a crossover each subject acting as its own reference would »have an effect on the response variable«?

Of course, not. IMHO, assuming the opposite is an insult to the mind. Therefore, the conventional model $$\small{\text{(III)}}$$ can be used (without a justification). In 40+ years I came across only two cases where model $$\small{\text{(II)}}$$ was requested by a European agency (one multi-group and one multi-site study).

However, never say never…

In a deficiency letter of 2021:

»As the subjects were enrolled and dosed divided in two groups one day apart […] the group effect should be included in the ANOVA model and a new statistical analysis should be performed or otherwise it should be further justified.«
NB, the size of the second group was just 6% of the first. CI of Cmax 87.96–100.93% by model $$\small{\text{(III)}}$$ and 87.88–101.37% by $$\small{\text{(II)}}$$.
Did the assessor expect that the study would suddenly fail, only because there is one degree of freedom less? Or was he/she worried about the one [sic] day separating groups?

An example of 2022:

An FDC product with two APIs, PK metrics AUCτ,ss, Cmax,ss, and Cτ,ss. The study was performed in two groups (31 and 28 eligible subjects); passed BE by $$\small{\text{(III)}}$$ per protocol and would have passed by $$\small{\text{(II)}}$$ as well.

1. A European [sic] agency asked for assessment of the Group-by-Treatment interaction. One of the six tests (Cτ,ss of one API) was significant (p 0.03). Not astonishing with the $$\small{G\times T}$$  test’s false positive rate of 10%…
2. Arguments of the applicant were not accepted. Instead, the agency asked for separate evaluation of groups. All PK metrics passed, except Cτ,ss in the smaller group, which failed by a small margin (78.76%).
Two APIs × three PK metrics × two groups = whopping twelve comparisons…

What’s the point? Of course, the study was powered for $$\small{\text{(III)}}$$ and the applicant was extremely lucky24 that the other PK metrics passed in the separate evaluations with roughly only half the planned sample size in both groups.
As an aside, the same drug was studied in single dose studies (fasting/fed) as well. No significant Group-by-Treatment interaction of Cmax and AUC0–t. Hence, overall there was a mere significant $$\small{G\times T}$$  test out of 14.
Discussions with the agency are ongoing. Outright bizarre. What a waste of time!

To our knowlege only one publication explored different models for multi-group studies.25 While the authors recommend a mixed-effects model, according to guidelines in 2×2×2 crossover studies only complete data should be used. We don’t agree with the authors that ‘group’ should be treated as a random effect, unless an additional group was recruited to counteract the loss in power due to dropouts.

# Recommendations

Like in testing for unequal carryover (aka the ‘sequence effect’, see this article) a statistical ‘correction’ of a true Group-by-Treatment interaction is not possible. One can only try to avoid problems by design.

• The FDA’s decision scheme (I) → (II) or (III) is not appropriate and hence, should be avoided.
• Due to the low power of the test, in the majority of cases a true Group-by-Treatment interaction will not be detected.
• Power is compromised and – more important – the Type I Error inflated.
• If a study is performed in multiple groups, model (II) is recommended, except in the EEU, where it is mandatory.
• If a study is performed in multiple sites, model (II) is mandatory.
• In-/exclusion criteria should be identical across sites.

In either case:

• All enrolled subjects should be randomly assigned to treatments at study outset.
• Samples should be analyzed by the same bioanalytical method in the same lab.
• The model must be unambiguously stated in the protocol.

### Acknowledgment

The 17 companies and individuals of twelve countries for providing data for the meta-analysis.

Anders Fuglsang (Fuglsang Pharma, Haderslev, Denmark), Detlew Labes (CCDRD, Berlin, Germany), Anastasia Shitova (Quinta Ana­ly­tica, Yaroslavl, Russian Federation), Michael Tomashevskiy (OnTarget Group, Saint Pe­ters­burg, Russian Federation), and Volodymyr Stus (Polpharma, Gdańsk, Poland), as well as user ‘zan’ of the BEBA-Forum for fruitful discussions.

### Postscript

We are still collecting data for the meta-study. The preferred file format is CSV (though XLS(X), ODS, XPT, or Phoenix project files would do as well). Of course, data will be treated strictly confidential and not published. Please send data to [email protected].

Columns:

1. Company (text)
2. Study code (text)
3. Analyte (text)
4. Drug (integer) 1 … number of analytes
5. Subject (integer or text) min(n) … max(n); ‘holes’ due to dropouts not a problem
6. Group or Site (integer) 1 … number of groups / sites
7. Sequence (character or integer)
8. Treatment (character) T or R
9. Period (integer) 1 or 2
10. AUC (numeric); for single dose AUC0–t or AUC0–72, for multiple dose AUC0–τ.
Missing values should be coded with NA.
11. Cmax (numeric)

Optional:

1. n (integer) total sample site
2. Interval (integer) if applicable, days separating groups
3. Sex (character) f or m

No cherry-picking, otherwise we will fall into the trap of selection bias. Hence, if you decide to provide data, please do so irrespective of whether you ‘detected’ a significant Group-by-Treatment interaction or not. We are primarily working on 2×2×2 crossover designs. However, if you have data of replicate designs, fine as well.
So far we have only one multi-site study. If you could share some data, great.

Helmut Schütz 2022
R and PowerTOST GPL 3.0, pandoc GPL 2.0.
1st version July 29, 2022. Rendered August 12, 2022 23:53 CEST by rmarkdown via pandoc in 0.64 seconds.

### Footnotes and References

1. Labes D, Schütz H, Lang B. PowerTOST: Power and Sample Size for (Bio)Equivalence Studies. Package version 1.5.4. 2022-02-21. CRAN.↩︎

2. Soetaert K. shape: Functions for Plotting Graphical Shapes, Colors. Package version 1.4.6. 2021-05-19. CRAN.↩︎

3. FDA, CDER. Guidance for Industry. Statistical Approaches to Establishing Bioequivalence. Rockville. January 2001. Download.↩︎

4. EMA, CHMP. Guideline on the Investigation of Bioequivalence. CPMP/EWP/QWP/1401/98 Rev. 1/ Corr. London. 20 January 2010. Online.↩︎

5. Council of the EEC. On the approval of the rules for conducting research of bioequivalence of drugs within the framework of the EEU. November 3, 2016; amended September 4, 2020.↩︎

6. Statisticians may be more familiar with the terminology $$\small{\textrm{AB}|\textrm{BA}}$$.
Forgive me – in bioequivalence $$\small{\textrm{TR}|\textrm{RT}}$$ is commonly used, where $$\small{\textrm{T}}$$ denotes the Test and $$\small{\textrm{R}}$$ the Reference treatment (formulation).↩︎

7. Bolton S, Bon C. Pharmaceutical Statistics. Practical and Clinical Applications. New York: informa healthcare; 5th edition 2009. p. 629.↩︎

8. If Subject 1 is randomized to sequence $$\small{\text{TR}}$$, there is not ‘another’ Subject 1 randomized to sequence $$\small{\text{RT}}$$. Randomization is not like Schrödinger’s cat. Hence, the nested term in the guidelines is an insult to the mind.↩︎

9. FDA, CDER. ANDA 077570. Bioequivalence Reviews. 2008. Control document #98-392 regarding the Group-by-Treatment interaction discussion. September 10, 1999. Online.↩︎

10. WHO. Frequent deficiencies in BE study protocols. Geneva. November 2020. Online.↩︎

11. Quoting my late father: »If you believe, go to church.«↩︎

12. Sagan C. The Demon-Haunted World: Science as a Candle in the Dark. New York: Random House; 1995. Ch. 12. The Fine Art of Baloney Detection.↩︎

13. Why so few? If ever possible my clients try to avoid multi-group and multi-site studies – and for a reason. More data sets than studies because some where FDC products.↩︎

14. Fisher RA. Presidential Address to the First Indian Statistical Congress. Sankhya. 1938; 4: 14–17.↩︎

15. D’Angelo G, Potvin D, Turgeon J. Carry-over effects in bioequivalence studies. J Biopharm Stat. 2001; 11(1&2): 35–43. doi:10.1081/bip-100104196.↩︎

16. Patterson SD, Jones B. Bioequivalence and statistics in clinical pharmacology. Boca Raton: CRC Press; 2nd edition 2017. Table 4.30. pp. 105–6.↩︎

17. Freeman PR. The performance of the two-stage analysis of two-treatment, two-period crossover trials. Stat Med. 1989; 8(12): 1421–32. doi:10.1002/sim.4780081202.↩︎

18. Berger RL, Hsu JC. Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets. Stat Sci. 1996; 11(4): 283–302. JSTOR:2246021.↩︎

19. Senn S, D’Angelo G, Potvin D. Carry-over in cross-over trials in bioequivalence: theoretical concerns and empirical evidence. Pharm Stat. 2004; 3: 133–142. doi:10.1002/pst.111.↩︎

20. Orwell G. Animal Farm. A Fairy Story. London: Secker and Warburg; 1945.↩︎

21. Schütz H. Multi-Group Studies in Bioequivalence. To pool or not to pool? Presentation at: BioBridges. Prague. 26–27 September 2018. Online.↩︎

22. EMA, CHMP. Questions & Answers: positions on specific questions addressed to the Pharmacokinetics Working Party (PKWP). EMA/618604/2008 Rev. 13. London. 19 November 2015 (originally published in Rev. 7. 13 February 2013). Online.↩︎

23. Before you ask: Sorry folks, that’s confidential. 😎 ↩︎

24. T/R-ratios and CVs were ‘better’ than assumed, as well as the dropout-rate lower than anticipated in the sample size estimation.↩︎

25. Bae K-S, Kang S-H. Bioequivalence data analysis for the case of separate hospitalization. Transl Clin Phar­ma­col. 2017; 25(2): 93–100. doi:10.12793/tcp.2017.25.2.93. ↩︎